Commonality and variation in mental representations of music revealed by a cross-cultural comparison of rhythm priors in 15 countries


Table of Contents

Procedure

Informed consent

All participants provided informed consent in accordance with the Ethics Council of the Max Planck Society (protocols 2017_12 and 2020_11), the Columbia University Institutional Review Board (protocol IRB-AAAR3726), the University of Western Ontario Health Science Research Ethics Board (protocol 108477), the KAIST Institutional Review Board (protocol IRB-KH2017-15), Durham University (Music Department Ethics Committee, February 2018), the Bogazici University Social Sciences Human Research Ethics Committee (protocol SBB-EAK 2017/1) and the Massachusetts Institute of Technology Committee on the Use of Humans as Experimental Subjects (protocol 1209005242R006). All participants received compensation for their involvement, the amount of which was consistent with the minimum wage regulations of their respective countries. We obtained verbal consent to publish images of participants and musicians.

Overview of procedure

The experiment measured iterated reproduction (sometimes referred to as serial reproduction or iterated learning38,39) of rhythms. The participants were instructed to synchronize their finger tapping to a repeating auditory stimulus presented over headphones. In previous work26 we found that synchronization to an ongoing rhythm produced similar results to an alternative task in which participants heard a pattern and then tapped a reproduction from memory. However, we found empirically that synchronization was easier to explain to participants and for this reason opted to use it for this cross-cultural study. They first completed a short training session (about 10 minutes long) familiarizing them with the apparatus and task (described below). The main experiment consisted of a series of trials, each of which contained five iterations.

On each trial we sampled a random seed uniformly from the triangular rhythm space, corresponding to a three-interval rhythm (s1, s2, s3). We then generated a sequence of clicks from the seed by repeating the three-interval seed pattern ten times. After a few clicks (typically a bit more than one cycle), participants began to synchronize to the click sequence (‘paced’ tapping). A MATLAB script (MATLAB 2018a) extracted response onsets from an audio recording of the participant’s taps (see ‘Onset extraction’ below). We averaged the inter-response intervals across the ten repetitions, obtaining an average three-interval response (r1, r2, r3). Taps were not always detected at every stimulus onset, both because participants sometimes failed to produce a tap in response to a stimulus click, and because produced taps were not detected with 100% reliability. We thus allowed some missing taps within each repeated three-interval pattern but required there to be ‘enough’ taps to estimate an average response. Specifically, we required that each of the three stimulus onsets within the pattern be associated with a tap onset in at least three of the ten repetitions. To obtain the average response, we replaced missing taps with the corresponding average onset time for the taps that were detected. We further required that the response (r1, r2, r3) was not situated far beyond the region we defined for human-producible rhythms (defined as not containing an interval shorter than 285 ms).

If the iteration satisfied these two criteria, we set the seed pattern for the subsequent iteration to the response pattern: (s1, s2, s3) ← (r1, r2, r3). If the iteration was invalid, the data from that iteration were omitted from analysis and the seed remained unchanged. We repeated this process five times. If there were three invalid iterations within a trial, the trial was stopped, and a new trial with a new seed was started (such failed trials typically were due to a rhythm being too difficult for a participant to reproduce, and this procedure was intended to minimize a participant’s frustration). For trials with two or fewer invalid iterations, the nth iteration was analysed as the nth iteration even if the iteration that preceded it was invalid. There was a fixed interval of approximately 4 s between iterations within a trial and a fixed interval of approximately 9 s between trials (both varied by up to 200 ms in either direction due to slight variation in computer systems across sites).

The number of trials that could be run in an experimental session varied depending on the location and is reported in Supplementary Table 2. In 13 locations we also performed an additional experiment with a faster tempo (a pattern duration of 1,000 ms). This additional experiment was always performed after the main experiment (with a pattern duration of 2,000 ms). The demographic information for this experiment is provided in Supplementary Tables 1 and 3.

Apparatus

Data were collected with between one and four computerized stations for each testing location. Each station included a computer, a Focusrite Scarlett 2i2 USB sound card, two sets of Sennheiser HD 280 Pro headphones and a tapping sensor (Fig. 2). This design is identical to that reported in the paper by Jacoby and McDermott that introduced the experiment paradigm26. Each sensor contained a microphone embedded in sound isolation materials and covered with a soft cloth to muffle impact sounds as much as possible. Instructions for constructing the sensor are provided in the supplementary Open Science Framework (OSF) repository. The microphone in each sensor was highly sensitive, and light touches generated bursts of noise that were recorded by the microphone. The sound card simultaneously recorded the microphone output and the audio stimulus played out by the participant’s headphones, so that the latency of the response recording relative to the stimulus was nearly eliminated (less than 1 ms). The specification of the hardware and instructions for building the sensor are provided in the OSF repository associated with this paper (see ‘Data availability’).

Stimuli

The stimulus on each trial was a rhythmic pattern composed of short percussive sounds (bursts of white noise) 65 ms long with an attack time of 5 ms (linear ramp) and decaying gradually over the remaining 60 ms, with the envelope hand-designed to mimic common percussion instruments (the decay amplitude decreased exponentially to 10% of the maximum over the first 55 ms, then decayed exponentially at a faster rate in the next 5 ms and was then truncated). These patterns contained ten repetitions of a particular three-interval rhythm. The stimuli were identical to those used in Jacoby and McDermott26. Software to replicate the experiment and sound material can be found in the OSF repository associated with this paper (see ‘Data availability’).

Three-interval rhythms

Each stimulus was defined by a pattern of three intervals (s1, s2, s3) constrained such that the overall pattern duration was 2,000 ms: s = s1 + s2 + s3 = 2,000. In addition, to avoid rhythms that were too fast for humans to reproduce, we constrained the initial seeds so that the smallest interval was larger than 300 ms (s1, s2, s3 > 300). We then repeated the interval pattern ten times, thereby forming a sequence of 30 intervals {S}1≤i≤30 = (s1, s2, s3, s1, s2, s3, s1, …, s3). From this sequence, we created a sequence of 31 onsets ({O}0≤t≤30) with intervals corresponding to S. The 31 onsets (‘clicks’) were defined with respect to the initial onset \({O}_{0},{O}_{t}={O}_{0}+{\sum }_{1\le i\le t}{s}_{i}\) for 1 ≤ t ≤ 30.

Projection to the rhythm triangle

We mapped a three-interval rhythm with inter-onset intervals (s1, s2, s3) to a point in a triangular rhythm space spanning all linear combinations of three extremal rhythms: \(\frac{{s}_{1}}{s}\,{\vec{P}}_{1}+\frac{{s}_{2}}{s}{\vec{P}}_{2}+\frac{{s}_{3}}{s}{\vec{P}}_{3}\), where s = s1 + s2 + s3, and where \({\vec{P}}_{i}\) are the vertices of the triangle (a simplex) (Fig. 1b; see also refs. 9,26). For visualization, we used an equilateral triangle, with \({\vec{P}}_{1}=\left(\mathrm{0,0}\right),{\vec{{P}}}_{2}=\left(\mathrm{1,0}\right),{\vec{P}}_{3}=\left({\frac{1}{2}},{\frac{\sqrt{3}}{2}}\right)\). Note that since the initial seed intervals satisfied the constraint that (s1, s2, s3) > 300 ms, the initial seeds were located within an inner triangular region with vertices \(\left(\,\frac{3}{2}{f},\frac{\sqrt{3}}{2}{f}\,\right),\) \(\left(1-\frac{3}{2}{f},\frac{\sqrt{3}}{2}{f}\;\right),\left(\frac{1}{2},\frac{\sqrt{3}}{2}(1-2f)\;\right)\), where f = 300/2,000 = 0.15 (Fig. 1b, inner region).

Onset extraction

We processed the microphone recording from each trial in non-overlapping windows of 15 s, detecting all samples exceeding a relative threshold of 1.45% of the maximal power of the recorded waveform in the window. This threshold was slightly more sensitive than the 2.25% threshold used in Jacoby and McDermott26; this change was made to accommodate a small number of participants (less than 3%) who tended to produce very light taps. In most cases, onsets were detected with minimal errors (as evaluated by comparing the detected onsets to what was audible from listening to example trials). We nevertheless took several steps to ensure that the detected onsets corresponded to actual taps. First, we discarded onsets that were too close to one another (less than 80 ms apart), as humans generally cannot produce two taps in such close proximity (see Repp3 for a justification of this threshold). Second, we discarded responses that were too far from any stimulus click, regarding them as errors. Here we took into account an important characteristic of human tapping known as ‘negative mean asynchrony’—namely, that tapping in time to a beat tends to be biased compared with the beat onset (typically occurring before the beat)3. We first matched each onset to its closest stimulus click and computed the mean asynchrony m as the average difference between a response and its corresponding stimulus click. We then excluded all events such that |(OR − m) − OS| > 150 ms where OS and OR are the stimulus and response onsets, respectively (namely, a window of 300 ms centred around the perceptual centre defined by the mean asynchrony) from further analysis. In addition, for the analysis, we included only points inside the triangle (see below). This resulted in 99,189 of 2,418,284 tapped responses being excluded from the main experiment (4.1%).

Apart from the change in the detection threshold mentioned above, this procedure was identical to the one reported by Jacoby and McDermott26. The code for the procedure is provided in the OSF repository associated with this project.

Procedure for the experimental session

The participants were asked to position the sensor in a comfortable way (see the OSF repository for the full instructions). In most cases, participants preferred to sit on a chair, positioning the sensor on their lap and using their entire hand or one finger for tapping. However, we allowed for different postures in different groups. For example, some participants in Mali preferred to sit on the floor with their legs stretched out in front of them. In all cases, the participants were encouraged to change posture or switch hands if they were fatigued; however, they were not permitted to tap with two hands simultaneously or to tap with two alternating fingers. In cases where participants nevertheless tried to do so, the experimenter would stop the experiment and repeat the instructions. The experiment was conducted using a fixed set of steps:

  1. (1)

    The first step of the experiment was to ask the participant to tap ‘a steady beat’ at any tempo that they liked. The idea of this task was to familiarize the participant with the sensor, as well as to characterize the spontaneous motor tempo of the participants103. This task was generally easy for participants, but the concept of ‘steady beat’ varied slightly across participant groups. For some participants in Mali and Uruguay, a ‘steady beat’ was a non-isochronous rhythmic pattern rather than an isochronous beat. In cases where participants continued to tap a rhythmic pattern rather than an isochronous beat, the experimenter repeated the instruction. In some cases, participants did not change their behaviour according to the repeated instructions and continued to tap a non-isochronous pattern; in these cases, we did not stop the experiment again and instead proceeded to the next step and recorded their performance as is.

  2. (2)

    The participants were then asked to ‘tap faster’.

  3. (3)

    Next, the participants were instructed to tap ‘as fast as they can’ for a couple of seconds. This step aimed to test that the participant did not have any severe motor constraint that would limit them in performing the task. After the participant tapped for about 3–6 s, the experimenter thanked the participant and stopped their tapping to avoid fatigue.

  4. (4)

    The next step was to familiarize the participant with the stimulus. The experimenter played a few isochronous clicks with an inter-stimulus interval (ISI) of 800 ms over the headphones and asked the participant to report if the sound was too loud or too soft. The level of the sound was then adjusted until the participant felt that the sound was at a comfortable level.

  5. (5)

    The participants were then instructed to tap along to an isochronous beat (with an ISI of 800 ms). This step could be repeated multiple times until the experimenter felt that the participant understood the task and was able to provide a synchronized response. Note that this was not easy for all participants—for example, some participants naturally tapped in antiphase (‘off-beat’; halfway in between the beat). In case of difficulties, the experimenter would first check that this was not caused by the participant’s posture, in which case the experimenter would suggest that the participant change their position to enable easier tapping. In other cases, the experimenter would demonstrate synchronous tapping to the participant. These additional steps allowed nearly all participants to successfully perform isochronous synchronous tapping. The rare participants who could not successfully perform isochronous tapping in this setting did not continue to the main experiment.

  6. (6)

    We then performed an isochronous tapping task at the same rate as in the familiarization phase (ISI = 800 ms) in which the participants tapped to a sequence of 56 clicks lasting 44 s.

  7. (7)

    We next performed an additional isochronous tapping task at a faster rate (ISI = 600 ms) with 75 clicks lasting 45 s.

  8. (8)

    Finally, we performed a tempo-changing tapping task in which the ISI alternated between 546 and 654 ms every 8–13 clicks (chosen pseudorandomly), with a total of 74 clicks lasting 45 s (the exact tapping sequence, identical for all participants, can be found in the OSF repository).

  9. (9)

    In the next step of the experiment, the participants were told that they would now tap a rhythm. The experimenter emphasized that, as before, the participant needed to ‘tap once for every click that they hear’. The participants were given five trials of random three-interval rhythms (each with five iterations) to get them familiarized with the task. The experimenter provided feedback to the participant only if they were performing the task in a qualitatively incorrect way, such as tapping on the off-beats or omitting a beat.

  10. (10)

    The participants performed 10–30 trials (mean, 22.0; s.d., 6.6) of the tapping experiment (each containing five iterations, where each iteration contained ten repetitions of the three-interval pattern, as described above). Due to the long duration, the participants were informed that they could ask for a short break at any time, and the experimenter included additional breaks at various times during the experiment.

  11. (11)

    The participants answered a set of demographic questions (see the OSF repository for the full list) during one of the breaks and/or at the beginning or end of the experiment.

  12. (12)

    In some locations, we performed an additional experiment after the completion of the main experiment, in which we repeated the main experiment with an overall pattern duration of 1,000 ms.

Procedures for replicability across sites

Testing stations were created by local research team members according to a set of specifications (see the OSF repository) or created according to the same specifications by N.J. and sent to teams in different locations. A written procedure describing the process of hardware preparation, software installation, task instruction and participant training was delivered to each group (see the OSF repository for the details). The task instructions were translated into local languages. The experimenters were either highly fluent in the local language or accompanied by a translator who was a native speaker. In most locations, the data collection team included an anthropologist or an ethnomusicologist with an expert understanding of the local culture, social groups and music. To ensure that the same procedures were used across sites, all teams were walked through the procedures by N.J., either directly or via video conferencing. Pilot data were collected and analysed at each site, and N.J. inspected the quality of the data before the collection of additional data. To assist this process, the MATLAB script generated images with a small file size (about 150 KB) that summarized the main statistics of data collection (the validity of the trial, the mean and standard deviation of tapping asynchrony (indicative of tapping accuracy) and plots showing the microphone recording levels). The same script also generated small binary files (about 4 KB) with summaries of the data (onset times for stimuli and responses). These files were sent to N.J. via low-bandwidth internet from remote data collection sites, which assisted in troubleshooting technical and data collection errors.

Testing conditions

Experiments in the United States (Boston and New York City) were run in sound-attenuating booths. Elsewhere, where possible, the experiments were run indoors in rooms without other activities (Brazil, Uruguay, the United Kingdom, Sweden, Bulgaria, Turkey, the Bamako site in Mali, India, South Korea and Japan). When run outdoors, the experimenter chose a location away from community activities that was relatively free of distractions and noise (Bolivia, the Sagele site in Mali, Botswana, Namibia and China).

Demographic questionnaires

We employed a demographic questionnaire to characterize musical experience, dance experience and basic demographic information (age, gender, education and spoken languages). We used a baseline demographic questionnaire (see the OSF repository) that was translated and adapted to different languages and participant groups by the researchers. There was some customization of the questions based on their relevance to the local culture. In each location, we consulted with ethnographers and translators regarding the relevance and translations of each demographic item.

Online measurement of tapped responses

To run the experiment online, we used REPP104, a software package for measuring sensorimotor synchronization in online experiments that works efficiently using hardware and software available to most online participants. To achieve temporal accuracy superior to that obtained by a web browser (which would have been inadequate for our experiment105,106), the software plays the audio stimulus through the participant’s laptop speakers and records the original signal simultaneously with the participant’s responses (which they supply by tapping on the laptop case) using the built-in laptop microphone. The resulting recording is then analysed to extract the participant’s taps. The method has been validated in a series of calibration and behavioural experiments104, and it achieves high temporal accuracy (latency and jitter within 2 ms on average). In addition, the method has been shown to provide results that are consistent with those obtained in a laboratory set-up (for example, for isochronous tapping, the lab–online correlation for the tapping precision of individual participants was measured to be r19 = 0.94; P < 0.001; 95% CI, (0.85, 0.98); see Experiment 2 in ref. 104).

To measure stimulus-coordinated tapping by online participants, we made some modifications to the original paradigm. One set of differences involved the stimulus. First, because the recorded audio contains both the stimulus and the response, we filtered the stimulus with a high-pass filter (cut-off frequency of 1,000 Hz) to avoid overlap with the frequency range typically occupied by tapping responses (80–500 Hz). Next, we added three custom audio ‘markers’ with known temporal locations at the beginning and end of each stimulus (six in total). These markers enabled us to unambiguously identify the positions of the stimulus onsets in the recorded audio and facilitated precise measurement of participants’ asynchronies. The markers were designed to be robustly detected across a variety of hardware and software set-ups, including cases of noise-cancellation technologies and ambient room noise. The marker sounds were generated from 15 ms bursts of bandpass-filtered white noise in the range of 200–340 Hz, to which we applied linear ramps at the onset and offset (2 ms long). We chose very short intervals between the markers (280 ms for the first interval and 230 ms for the second) to avoid participants confusing the markers with the repeated rhythm (the rhythm pulses and the markers also differed substantially in timbre due to the different frequency ranges).

A second set of differences involved the response recording. The online experiments used free-field recording whereby the audio stimulus is played through the laptop speakers and simultaneously recorded along with the participant’s tapping response using the built-in laptop microphone. This returns an audio file where both the audio stimulus and participant tapping are superimposed. To separate this recording into the different components of the stimulus and response, we used bandpass filters. Since most of the energy in the tapping signal occurs at low frequencies, filtering the recording around the tapping range (80–500 Hz) isolated the tapping response with a high degree of efficacy. We were also able to isolate the markers by filtering in their frequency range (200–340 Hz). In addition, we applied a filter in the 100–170 Hz range, the output of which was used for calibration. Because the markers had no energy in this range, this helped determine the noise level in the recording, which we used to adaptively set the marker detection thresholds. This allowed us to reliably estimate the marker locations even with very noisy or low-quality recordings, as characterize some laptop models and brands.

The detected stimulus markers were used to estimate and compensate for the latency of the recording and to estimate the jitter in the recording. This enabled us to monitor the timing accuracy of each individual trial, crucial to ensuring that timing accuracy remained high in all trials. See Experiment 1 in Anglada-Tort et al.104 for the full details of the calibration experiments and their validation.

After the online-specific pre-processing to isolate the tapped response and compensate for the recording latency, we used the same pipeline as in the main experiment to align the tap onsets and perform the tapping analysis.

Online iterated reproduction procedure

To meet the challenges of online data collection, such as poor control over participants’ hardware and software and a higher risk of fraudulent responders, we made two minor changes to the iterated reproduction procedure.

  1. (1)

    In the online version of the experiment, the analysis of the recording could take a few seconds to complete (for example, from uploading the audio recording, performing the signal processing and synthesizing the new stimulus). To avoid unnecessary wait times, we did not run consecutive iterated reproductions of the same rhythm seed, as in the laboratory. Instead, we ran six ‘chains’ of iterated reproductions in parallel. On each trial of the online experiment, the participants performed a single iteration of a chain (that is, tapping to a single ten-repetition rhythm—either a seed rhythm or the result of the reproduction from the preceding iteration of the chain). Each trial was randomly assigned to one of the chains that was not used for the previous trial. Each participant completed all six chains during an experimental block. In Jacoby and McDermott (Experiment 4)26, we showed that the results of this parallel chain procedure do not differ substantially from those of the original paradigm, where the iterations from different chains are not intermixed.

  2. (2)

    The change in trial order also necessitated a change in the failing criteria: if participants failed a trial, they repeated the trial with a maximum of ten possible additional attempts, but these repeated trials were randomly drawn from the six chains (the ten allowed failed trials were tallied globally within the block). In those cases where this limit was exceeded, the experiment was terminated.

In addition, the following seven modifications were made to the overall procedure to ensure that the online participants met the technical requirements for the online experiment and were able to provide good tapping data consistently throughout the experiment. The different steps in this procedure have been extensively piloted and optimized to ensure high data quality when collecting tapping data in online settings104.

  1. (1)

    First, the participants were instructed that the experiment could only be performed using the laptop speakers and that they should unplug any headphones/earphones or disconnect any wireless devices. They were also instructed to remain in a quiet environment.

  2. (2)

    The participants were then asked to set the volume of their speakers to a level that was sufficiently high to be detected by the microphone. A sound meter was used to visually indicate when the level was appropriate.

  3. (3)

    After the volume test, the participants completed a short recording test to detect hardware and software that did not meet the technical requirements of the experiment, such as malfunctioning speakers or microphones. The recording test played a test stimulus with six marker sounds. The markers were recorded with the laptop’s microphone and analysed using our signal processing pipeline. During the playback, the participants were supposed to remain silent. There were a total of three such test recording trials, and we provided feedback after the first trial based on the recording quality: if the markers were not recorded (for example, this could occur if the participant forgot to unplug their headphones), we reminded the participants that they needed to unplug any headphones. If, despite these reminders, marker sounds could not be detected in two of the three trials, the participant was excluded from the experiment. Note that this process also serves as a basic test of task compliance, as the participants must follow the instructions (for example, accept the enabling of the microphone in the browser, unplug any headphones or wireless devices and adjust the volume of the computer) to pass the test. Participants who did not satisfy the technical conditions or who abandoned the experiment at this stage were excluded (747 of 1,303). This relatively high percentage of participants that did not satisfy the technical inclusion criteria is consistent with previous online tapping experiments104.

  4. (4)

    Participants who passed the recording test were then directed to a tapping calibration test. Here, the participants were asked to tap on the surface of their laptops with their index finger to test whether the microphone could detect their taps, using a sound meter to visually provide feedback. In cases where the signal was too low, the participants were asked to tap on different locations of the laptop or to try to tap harder.

  5. (5)

    Next, the participants performed a practice phase to acquaint themselves with the main tapping task. The practice phase consisted of four trials using the stimulus sampling procedure of the main experiment (for example, three-interval rhythms randomly sampled from the triangular simplex with a fixed duration of 2,000 ms and repeated ten times). In the first two trials, we provided feedback to the participants based on their recording quality and tapping performance. We used the remaining two trials to exclude participants who were still unable to provide good tapping data, as assessed by failing in one or more of these two trials. A trial was considered a failure if we could not detect all marker sounds, or if the detected markers were displaced relative to each other by more than 15 ms, or if the percentage of detected taps (that is, the number of detected tapping onsets out of the total number of stimulus onsets) was less than 50% or more than 200%. Note that none of these criteria involve participants’ accuracy in replicating the target rhythm; they only reflect whether the signal could be correctly recorded and processed, and whether the participants produced a minimally/maximally acceptable number of tapping responses. An additional 358 participants were excluded on this basis.

  6. (6)

    Participants who passed the practice phase were then able to start the main tapping task, which used the same procedure described for the in-person experiments except for the modifications mentioned above.

  7. (7)

    As mentioned above, a main difference between the in-person and online experiments was that the latter consisted of shorter and more flexible experimental sessions. Namely, the experiment was divided in different blocks of six chains per block. After completing one block, the participants could decide whether to continue with the next block or to instead end the experiment. There was a maximum of three blocks per session (18 chains). The motivation behind this design choice was to keep online experimental sessions engaging and short, always allowing the participants to decide whether to complete more trials or not.

After completing the first block, the participants answered the same set of demographic questions used in the in-person experiments. We excluded participants who abandoned the experiment prior to its completion or who did not complete the full demographic questionnaire that we administered at the end of the experiment (67 additional participants were excluded on this basis). In total, 131 online participants completed the full experiment and were analysed.

Participants

We tested 39 participant groups spanning five continents and 15 countries (Extended Data Table 1). Overall, we recruited 923 participants (792 were run face-to-face and 131 online) who completed a total of 20,287 trials (seeds) and 2,319,095 individual taps.

Criteria for group selection

Participant groups were chosen to provide a strong test of any potentially universal features of the results. We included groups from both industrialized and non-industrialized societies, as well as groups of local musicians from some non-Western societies (who performed different types of non-Western music). We also tested groups of musicians and dancers where possible, as these populations would be expected to have substantial exposure to particular musical styles. In addition, we tested university students in a number of countries to assess potential effects of exposure to Western culture, which we presumed would be correlated with university attendance. The groups tested were also determined in part by practical constraints (testing time and access to particular populations). Age and gender could not be fully equalized across groups. For example, Malian professional jembe drummers and Uruguayan candombe drummers (the populations recruited for MA-LM and UY-LM) are both relatively small groups—less than 50 individuals—composed of highly skilled professionals, and were predominantly male. In both cases, only one participant in each group was female. The substantial experience required for membership in these groups also resulted in these participants being older (Mali: mean age, 40.5 years; s.d., 11.9; Uruguay: mean age, 45.5 years; s.d., 12.8). At the other extreme, dancers in the Sagele village in Mali (MA-DA) were predominantly female.

Sample sizes

We conducted a power analysis using data from US participants collected for a previous publication26. The approach was to try to collect enough trials that the test–retest reliability of the estimated prior for a group was likely to be relatively high (with the goal of having enough data that the results of the experiment would be similar in a hypothetical future replication). The test–retest reliability was estimated using the split-half reliability of our previously collected data following Spearman–Brown correction107. We simulated different amounts of data by subsampling the number of trials used to estimate the prior (resampling without replacement). We found that 250 trials produced a test–retest reliability greater than 0.8, and we thus targeted this number for the sample size of each group. In practice, we often ran more trials if circumstances permitted (between 261 and 948 trials and therefore up to 3.8× the target number of trials; Supplementary Table 2). The only exceptions were the two groups in Botswana, for whom we did not reach this recruitment target because of practical constraints on testing time (170 and 127 trials for the San and Etsha groups, respectively; abbreviated as BO.SA and BO.EA). However, the post hoc reliability of the data collected for these groups was not far below our target value (0.75 and 0.67 for BO.SA and BO.EA, respectively). The post hoc reliabilities of all other groups were high, meeting or exceeding the predicted value of 0.8 (ranging from 0.8 to 0.96; mean, 0.9; s.d., 0.03).

Definition of group types

Students (ST)

We defined students as members of local universities in either undergraduate or graduate programmes.

Musicians (WM and LM)

For brevity, we used the term ‘musicians’ to describe participants with relatively extensive musical experience, acknowledging that musicianship is a concept that changes from place to place47. On the basis of previous work, we defined recruiting criteria that generalize more broadly for different cultural contexts27: (1) professionalism—‘Do you make most or part of your living from music, or did you in the past?’, (2) training—‘Did you undergo music training (such as an apprenticeship or formal study)?’ and (3) public playing—‘Do you perform in public?’ In some locations (the United States, Uruguay, the United Kingdom, Bulgaria, Turkey, Mali and Namibia), all musicians satisfied all criteria, while in other locations (Brazil, Sweden, Botswana, India, Korea and Japan), we also included people who satisfied the last two criteria but not the first one. For all participants, we recorded self-reported years of regularly playing an instrument or singing, as is common in music cognition literature (Supplementary Table 2). We recruited both musicians who play Western classical music (WM) and musicians who play a local musical style that is not Western classical music (LM). We note that for the local musician groups, the nature of the local musical style varied somewhat from group to group; in some cases (most notably for the US.NY-LM group, who played jazz) the musical style was one that had spread globally.

Dancers (DA)

In Bulgaria, we recruited dancers who were members of the same professional ensembles from which we recruited the musician groups. In Mali, we recruited group members of a local recreational dance association that promotes events featuring traditional dance and music.

Non-musicians (NM)

Non-musicians were people who did not satisfy any of the inclusion criteria for the other groups. Their self-reported years of musical experience were substantially less than those of the musician groups in all cases (Supplementary Table 2).

Online (OL)

Online participants were recruited from Amazon Mechanical Turk. Their geographical location was determined by the Amazon qualification system and verified with IP geo-location. We tested participants from the 3 countries of the 15 from which the other participant groups were drawn with substantial Mechanical Turk worker pools (the United States, Brazil and India)108.

Recruiting locations

Basic demographic information for each participant group is provided in Extended Data Table 1. Supplementary Table 2 provides additional information about each group including the number of languages spoken, the languages spoken, years of self-reported musical experience, instruments played and favourite artists or musical genres. We report demographic variables that we were able to reliably measure, and we note that these are not the only variables that varied across groups and that might influence the results. These factors were all based on self-report questionnaires, except the literacy level within the group, which was estimated by the experimenters. Here we provide additional information about each group, ordered according to their geographical location.

United States: Boston—students and Western classical musicians (US.BO-ST and US.BO-WM)

The Boston student participants were students from local universities, recruited using the MIT Brain and Cognitive Sciences Department participant mailing list and through additional online advertisement. All participants were residents of the Boston area, a metropolitan region of New England with over eight million inhabitants. The musician group was recruited from the same departmental mailing list as well as via a social media ad targeting conservatory students from the Boston area. All participants in this musician group had formal training in music. Some of them were professional musicians. Most participants in the musician group played Western classical music, though some also played other styles such as pop and jazz. There was some overlap between these groups and the US participant group in a previous publication26; the groups were not identical due to different exclusion criteria in the two studies.

United States: New York City—non-musicians, Western classical musicians and jazz musicians (US.NY-NM, US.NY-WM and US.NY-LM)

New York participants were recruited by word of mouth, campus advertisements at Columbia University and online advertisements. The participants were residents of the New York City metropolitan area, a densely populated region in the United States with over 18.8 million people. We recruited three groups: non-musicians, musicians specializing in Western classical music (the WM group) and jazz musicians (the LM group). Both musician groups were a mix of music students and professional musicians. All musicians had formal education and training in music.

Bolivia: Tsimane’—non-musicians (BO.TS-NM)

Tsimane’ are an Indigenous people of lowland Bolivia, comprising about 19,000 individuals who live in about 130 small villages mostly along river basins (including the Maniqui River), located in the department of Beni (a subdivision of Bolivia). They subsist mostly on farming, fishing and hunting. Tsimane’ have traditional music, familiarity with which varies across individuals. As reported by Riester109, their traditional songs have characteristic rhythmic patterns. The most common such pattern reported by Riester can be written in ratios as 1:1:2. Traditional Tsimane’ musical culture also once included shamanic practices with drum playing, but these practices are no longer in use110. The region containing Tsimane’ communities is undergoing rapid modernization due to a push by the Bolivian government and non-government organizations to provide services to the Indigenous peoples. Radio usage is now fairly common, and villages near the local town of San Borja tend to have electricity. During the mid-1950s, Protestant missionaries from the United States settled permanently along the river Maniqui to proselytize Tsimane’, setting up the first rural schools for them and teaching them church hymns111. More recently, evangelism has spread Christianity within and across many villages. Thus, in addition to their knowledge of traditional music, nowadays most Tsimane’ villagers are somewhat familiar with religious Christian hymns. These hymns are monophonic and sung in Tsimane’. They are similar to traditional Tsimane’ music in relying on small intervals and a narrow vocal range and are sometimes accompanied by other instruments played by community members. Group singing appears to be rare, irrespective of whether the material is traditional songs or hymns. For the present study, we recruited participants in three Tsimane’ villages. Upon arriving at each village, we used a horn or a bell to initiate a community meeting where we introduced the researchers and registered participants for experiment sessions. Two of the villages (Mara and Moseruna) were a two-day walk or a three-hour car ride from San Borja, along a road that was accessible only to high-clearance vehicles and motorcycles if recent weather had been dry. The other village (Yaranda) was located along the Maniqui River and accessible only by a one-day trip on a motorized canoe. All three villages have relatively little communal church singing. The participants had varied musical experience, but none regarded music as a profession. None of the participants had formal training in music. Two participants reported regularly playing an instrument, and 34 participants reported playing an instrument at least once.

Bolivia: San Borja—non-musicians (BO.SB-NM)

San Borja is a small town in the Bolivian department of Beni in the Amazon basin with over 20,000 residents. At the time of data collection, San Borja could be reached by car during dry months of the year but was accessible only by small planes during much of the rainy season. Participants were recruited by word of mouth and had resided for most of their life in San Borja. The participants had not received formal education in music.

Bolivia: Santa Cruz de la Sierra—non-musicians (BO.SC-NM)

Santa Cruz de la Sierra is the largest city in Bolivia, with a population of over 1.8 million people. Participants were recruited using an online advertisement and word of mouth. All participants had been born and raised in the Bolivian department of Santa Cruz, and they resided in the city at the time of the experiment. The participants had not received formal education in music.

Bolivia: La Paz—students and non-musicians (BO.LP-ST and BO.LP-NM)

La Paz is the third-largest city in Bolivia, with a population of over 0.9 million people, located in the Andes. We recruited the student group from local universities. Most student participants were recruited by a student who was a research assistant. For the non-musician group, we recruited blue-collar workers employed by a hotel and their relatives. These participants were mostly from El Alto, a city adjacent to La Paz (with a population of about one million people), and many of them were Indigenous (Aymara or Quechua). Many participants reported experience with traditional music and dance in childhood or as adults. The participants had not received formal education in music.

Brazil—local musicians (BR-LM)

We recruited percussionists in the Recife metropolitan area, a city in Pernambuco, part of Northeast Brazil. The population of the Recife metropolitan area is over four million people. The percussionists practice a local style of music that is part of the Maracatu-nação (translation from Portuguese: ‘nation maracatu’) cultural tradition. This region-specific tradition, usually perceived as Afro-Brazilian, involves religion, music, song, dance and elaborate costumes. Its history is disputed but is most often linked to the colonial coronations of queens and kings among enslaved Africans in Brazil (in the sixteenth to nineteenth centuries). However, the earliest descriptions date only from the beginning of the twentieth century112,113. The music is performed by community-based groups in the slums of the metropolitan area of Recife termed maracatus-nação, approximately 30 of which are currently active. These groups include a costumed dance group and a large percussion ensemble. The groups perform in parades that take place primarily during Carnival. The percussionists we recruited either grew up within one of these community groups or had participated in one for at least several years. The participants had not received formal education in music in a university or conservatory but received substantial training as described above.

Uruguay—students and local musicians (UY-ST and UY-LM)

Participants in Uruguay were all recruited in Montevideo, the capital of Uruguay, with a population of over 1.3 million people. The musicians were performers of Uruguayan candombe drumming59,114. All participants were born and raised in neighbourhoods with a strong tradition of candombe drumming and were ‘native players’, having acquired their competence in the style by direct transmission. The majority of the participants were outstanding players, regarded as master drummers by the community. As a rule, the participants did not have formal musical training in a conservatory or university, although a few of them had some basic knowledge of music theory and could play instruments other than the drum (such as keyboards and bass). However, they had all had substantial practical training in drumming since childhood. The participants were recruited by L.J. and M.R., who are local experts in this musical style and are familiar with the local musicians. The students were members of the local university with no formal musical training but with passive exposure to music, including Uruguayan candombe drumming. They were recruited by two research assistants by word of mouth.

United Kingdom—students and jazz musicians (UK-ST and UK-LM)

Two groups of participants were recruited in North East England and Scotland, an area of the United Kingdom with over eight million people. Most participants were recruited from Durham, a county in England with over 500,000 people. The first group consisted of students from Durham University. The second group consisted of instrumental jazz musicians, comprising a mix of professional musicians and students currently studying music at university (the students were recruited from Durham University and the University of Edinburgh). All participants in this group reported that they perform jazz in public and earn money from performing, and all had formal training in music in a university and/or conservatory. These participants played a range of instruments (including piano, saxophone, guitar, trumpet, drums and double bass), and most reported performing in a range of different sub-genres and groups, most commonly big bands and smaller ensembles (such as trios) playing jazz standards. Some participants reported liking or performing musical traditions from other cultures, including Latin and Balkan-influenced music.

Sweden—local musicians (SE-LM)

The recruitment of musicians in Stockholm (the largest metropolitan area in Sweden, with over 2.4 million people) focused on students and teachers of folk music performance at the Royal College of Music, and on dance students at the School of Dance and Circus at the Stockholm University of the Arts, as the latter also had extensive experience playing music. Among the latter group, we required the participants’ focus to be Swedish folk dance. Of the 22 recruited participants from the Stockholm area, 9 considered themselves mainly as dancers and 13 mainly as musicians, and 1 participant identified as a dancer and musician to the same degree. In addition, 72% of the participants reported making money from performing music or dance, and all but 4 reported performing in public. Independent of this self-categorization, all participants asserted that they dance and have experience with either instrumental or vocal music making.

Bulgaria—local musicians and dancers (BG-LM and BG-DA)

The recruitment of musicians and dancers in Bulgaria focused on members of a type of professional folk ensemble that developed once the country adopted a communist system of government after the Second World War115,116. These ensembles typically consist of an orchestra of folk instruments, a women’s choir and a dance troupe, and they give stage performances of arranged and newly composed Bulgarian folk music and elaborately choreographed folk dances. Most performers in these ensembles have studied folk music performance or choreography in the Bulgarian conservatory system. Some of the dancers who participated in the study belonged to professional or semi-professional dance troupes that perform Bulgarian folk dances with recorded rather than live music. We recruited participants in three Bulgarian cities: Pleven (a city with approximately 90,000 people), Plovdiv (a city with approximately 343,000 people) and Sofia (the capital of Bulgaria and the largest metropolitan area, with over 1.2 million people). The participants were members of the ensembles in these cities and were contacted with permission from the ensemble directors.

Turkey—students and local musicians (TR-ST and TR-LM)

The group of Turkish student participants was recruited from Istanbul Technical University Turkish Music State Conservatory and Bogazici University. The Istanbul metropolitan area has over 15.8 million people. The musician participant group was recruited from the cities of Izmir and Istanbul and included professional musicians who had studied at institutes or conservatories for traditional Turkish music, such as Istanbul Technical University Turkish Music State Conservatory. These musicians were experienced in Turkish folk music or dance through formal training or extensive practice in groups or ensembles. They had experience in various sub-genres of Turkish music, ranging from Aegean to Black Sea regions, also including Sufi music. The musicians were involved in ensembles performing traditional, religious, classical or modern Turkish music, as well as Western music with Eastern influences.

Mali—students, local musicians and dancers (MA-ST, MA-LM and MA-DA)

The group of Malian university students was recruited in the capital city of Bamako and comprised both BA- and MA-level students as well as recent graduates of the University of Bamako. Bamako is the capital of Mali and the largest metropolitan area in the country, with a population of over two million people. Students were recruited by N.D. by word of mouth. The musician group was also recruited in Bamako. Their main performance work occurred at local music and dance events, primarily wedding celebrations, but they also worked in the national and international scenes of staged folk dance and percussion music117. Most musicians did not have training in music from a university or conservatory but instead had substantial training via traditional, practice-based apprenticeships. By contrast, Malian dancers were not active as specialized musicians (and did not have formal education in music or dance) but regularly danced at wedding celebrations in which professional musicians performed, and thus were highly familiar with local styles of music. Only one of the participants reported receiving money from playing or dancing. R.P. (who has more than 30 years of experience working with Malian musicians) recruited the musicians with the help of a research assistant. The dancer participants were recruited on the basis of their membership in a local dance organization in the peasant village of Sagele, approximately 75 km southwest of Bamako. Sagele has a population of approximately 5,000 people.

Botswana: San—local musicians (BW.SA-LM)

The San musician group was recruited in D’Kar, a village ~640 km northwest of Gabarone (the capital of Botswana), with a population of about 1,700 people. The participants were either members of a local organization that performed traditional songs and dances of the San culture (primarily for nearby events or for tourists) or local residents with substantial musical experience (at least ten years of self-reported musical experience) but no formal musical training in a university or conservatory.

Botswana: Etsha—local musicians (BW.EA-LM)

The second Botswanan group was recruited in Etsha, a group of villages ~320 km north of D’kar, with a population of about 10,000 people. The participants were members of two groups who performed traditional songs, or local residents with substantial musical experience (more than seven self-reported years of musical experience). The participants came from two closely related Okavango Delta region subcultures: Hambukushu and Bayei. Although the Hambukushu and Bayei are culturally distinct, their geographic proximity means they are each exposed to each other’s music, with far less exposure to the music of the San. The participants had no formal musical training from a university or conservatory.

Namibia—non-musicians and local musicians (NM-NM and NM-LM)

Participants in Namibia were recruited from a small local Damara community (population ~500) in the Spitzkoppe region via word of mouth during the week prior to the start of the study. Recruitment and data collection occurred with help from two local research assistants, who also acted as translators. The musician group was drawn from two active musical groups, both of whom had extensive experience with traditional music of the region: a ‘cultural singing group’ (who frequently practice and perform |ais (old traditional folk songs and dances), in addition to other styles of performance) and a ‘youth choir’ (who typically rehearse and perform elob mis, a form of gospel music, sung in Khoekhoegowab (the local language), English and Afrikaans). Both groups perform at local and regional community events (such as weddings, funerals and an annual Damara traditional festival). The singing group intermittently performs for visiting tourists, and the youth choir also performs weekly in church. The first group typically earns money in exchange for these performances, though most members would not be construed as professional performers and did not have formal training from a university or a conservatory. The non-musician participants were members of the local community who were not part of either group. Although almost all participants reported engaging in some form of singing (for example, joining in at weddings or other community events, or simply while listening to the radio), these non-musician participants did not regularly practice or perform with any group, and many had limited or no knowledge of old traditional song lyrics and dances.

India—non-musicians and local musicians (IN-NM and IN-LM)

In India, the experiment was conducted at I.I.T. Bombay in the city of Mumbai (a city with over 20.9 million people). The non-musicians all worked or studied at I.I.T. Bombay and had not had any musical training or substantial exposure to Indian classical music. We classified the group as non-musicians rather than students because only a minority (6 of 15) were students. Most of the musician participants were professional musicians living in the city, but a few were students at I.I.T. Bombay. Most of them were trained in the North Indian (Hindustani) form of art music. Formal training typically involves taking one-on-one lessons from a teacher (the appointed guru) and performing in public concerts. Some of the musician participants were primarily vocalists but also played an instrument, and about half of them were tabla (percussion) players who accompanied vocalists in concerts. All but three musicians reported currently playing in public concerts, and 64% reported at least sometimes receiving money from playing.

South Korea—students, Western classical musicians and local musicians (KR-ST, KR-WM and KR-LM)

The Korean student participant group was recruited from Chungnam National University in Daejeon, a metropolitan area containing over 1.4 million people. The Western musician group consisted of students from the Department of Music at Chungnam National University. These participants had been exposed to both Western music and K-pop music via mass media. The local musician group comprised students studying traditional Korean instruments in the Department of Korean Music at Jeonbuk National University located in Jeonju, a metropolitan area containing over 651,000 people. These participants had trained on Korean traditional instruments for many years but also had extensive exposure to both Western and K-pop music.

Japan—students, Western classical musicians and local musicians (JP-ST, JP-WM and JP-LM)

In Japan, the student and Western musician groups were recruited from Keio University Shonan Fujisawa Campus in Fujisawa, Kanagawa, near Tokyo. Tokyo is the capital of Japan and the metropolitan area with the largest population (over 37.2 million people). Participants in the student group had no formal musical training. The Western musician group had formal music training in Western instruments. The local musician group was recruited from Tokyo University of the Arts (Japan’s leading music conservatory). They all played at public events, and 73% received money for playing music. We recruited students studying traditional Japanese instruments (shamisen, koto, shakuhachi or hayashi instruments from the noh ensemble) in the Department of Traditional Japanese Music. These students had to train on these instruments for many years to qualify for acceptance into the department and can be considered ‘bi-musical’118 in that they had extensive exposure to both popular Western and traditional Japanese musical systems. All Japanese participants had extensive passive exposure to both Western music and Western-influenced Japanese popular music via radio, TV and other media. All Japanese participants were recruited from Tokyo and its surrounding cities.

China—non-musicians (CN-NM)

Participants in China were recruited from a cluster of Dong minority villages in the Guizhou Province in southwestern China. The Guizhou Province contains approximately 38.5 million people, and the Dong villages each have approximately 500 to 2,000 people. Singing features prominently in many village activities, but the most famous and distinctive tradition of Dong song is Dage, or Big Song, recognized by UNESCO in 2009 on their list of humanity’s Intangible Cultural Heritage. There is wide participation throughout the villages in learning and performing Dage, a genre of two-part group polyphonic singing, occasionally metred, where words are central and pitch height and contour carefully follow the lyrics119. Dage is performed informally within people’s homes and more formally within the drum-towers found in many villages119. The participants had no formal musical training from a university or conservatory.

Online participants from the United States, Brazil and India (US-OL, BR-OL and IN-OL)

All online participants were recruited using Amazon Mechanical Turk. In the online advertisement, we required the participants to meet the following five criteria: they had to (1) be at least 18 years old, (2) speak English, (3) use a laptop to complete the experiment (no desktop computers allowed), (4) use an up-to-date Google Chrome browser (due to compatibility with the technology) and (5) sit in a quiet environment, to ensure that their tapping could be recorded precisely. To test participants from the United States, India and Brazil, we used Amazon Mechanical Turk’s qualification system, which allows researchers to recruit participants registered in each location. To ensure that the participants were undertaking the experiment from the registered location, we only included participants whose registered location matched their IP-based geo-location, which we verified using the service ipinfo.io. We estimate that the Mechanical Turk participant pool in the United States has a few thousand unique active participants per week. In India and Brazil, we estimate that the number of unique active participants is about 500 and 200, respectively.

Analysis

Small-integer ratios

Following Jacoby and McDermott26, we considered ‘small-integer-ratio rhythms’ to be those with ratios composed of the integers 1, 2 and 3 that fell within the rhythm triangle. This results in the following 22 unique ratios: Ω22 = {1:1:1, 1:1:2, 1:2:1, 2:1:1, 1:2:2, 2:1:2, 2:2:1, 1:1:3, 1:3:1, 3:1:1, 1:2:3, 2:3:1, 3:1:2, 1:3:2, 2:1:3, 3:2:1, 2:2:3, 2:3:2, 3:2:2, 2:3:3, 3:2:3, 3:3:2}. We also grouped together categories that are equivalent under cyclic permutation, resulting in eight categories: Ω8 = {111, 112, 122, 113, 123, 223, 233, 132}.

Kernel density estimates of the prior

The experiment consisted of a number of trials, each of which consisted of iterated reproduction of a random seed rhythm. We estimated a participant’s prior using the data from the fifth and final iteration of each trial, having demonstrated in Jacoby and McDermott26 that five iterations are sufficient for the iterative procedure to converge to the prior (Supplementary Figs. 1, 2 and 7 in Jacoby and McDermott; see Supplementary Fig. 2 of this paper for analyses of convergence in each group)26. Before this analysis, we excluded points outside the inner triangular region defined earlier that was intended to correspond to the region of human-producible rhythms (with vertices \(\left(\,\frac{3}{2}{f},\frac{\sqrt{3}}{2}{f}\,\right),\left(1-\frac{3}{2}{f},\frac{\sqrt{3}}{2}{f}\;\right),\left(\frac{1}{2},\frac{\sqrt{3}}{2}(1-2f)\;\right)\), where f = 300/2,000). The prior was then estimated by adding together Gaussian kernels, with mean μi and covariance Σi empirically computed from the repetitions of the rhythm within the fifth iteration (there were up to ten repetitions depending on the number that the participant correctly produced during the fifth iteration; for repetitions that had missing taps, the missing tap(s) was replaced by the mean onset of the successfully produced taps at that stimulus position). Since this covariance matrix is estimated on the basis of small numbers of samples, we added a regularization term: \({\varSigma }_{i}^{{\prime} }={\varSigma }_{i}+\gamma I\), where I is the identity matrix, and γ = 15 ms (we slightly increased this value compared with the γ = 10 ms of Jacoby and McDermott26 since some participant groups had lower numbers of correctly reproduced repetitions). We averaged these kernels (\({G}_{i}\left(x\right) \sim {\mathrm{N}}({\mu }_{i},{\varSigma }_{i}^{{\prime} })\), one per trial) across all completed trials within a participant group, obtaining a distribution \(P(x)=\frac{1}{N}{\sum }_{i=1}^{N}{G}_{i}\left(x\right)\) over the triangle. For statistical analyses, we represented these distributions in bins spanning 0.006 in each dimension of the triangle (that is, 12 ms given the 2,000 ms pattern duration). To generate high-resolution images for the paper figures, we used bins of size 0.003 in each dimension. Supplementary Figs. 3–9 show high-resolution kernel density estimates for all groups.

Normalizing density compared with uniform distribution

As described above, the random seeds were constrained to have the smallest interval exceed 15% of the pattern duration (300 ms), corresponding to a smaller triangular region within the full rhythm triangle. We defined the uniform distribution over this smaller region as U. To avoid working with small numbers, we pointwise-normalized the kernel density estimate P with respect to U—namely, P′(x) = P(x)/U(x). We note that in all images depicting kernel density estimates, the density was clipped at a value of 5 (relative to uniform) to preserve the dynamic range for details at low density values. In the OSF repository associated with this project, we included images with less clipping (relative density of 10).

Jensen–Shannon divergence

To compare distances between distributions, we used the Jensen–Shannon divergence. The Jensen–Shannon divergence of two distributions P and Q is defined as

$${\mathrm{JSD}}(P,Q)=\frac{1}{2}{D}_{{\mathrm{KL}}}\left(P,M\,\right)+\frac{1}{2}{D}_{{\mathrm{KL}}}\left(Q,M\,\right)$$

where \(M=\frac{1}{2}\left(P+Q\right)\) and \({D}_{{\mathrm{KL}}}\left(P,Q\right)=\displaystyle\int_{x} P(x)\ \log_2\left(\frac{P(x)}{Q(x)}\right)\ {\mathrm{d}} \ \!x\).

Gaussian mixture model fits

To measure the relative weight of each category in a group’s prior, we used a Gaussian mixture model in which the mean of each mixture component was constrained to be close to a small-integer-ratio rhythm. This constraint aided interpretability by removing the degeneracy in the correspondence between mixture components and the modes of the data distribution, guaranteeing that each mode was associated with the same category across groups, while allowing the mixture components to deviate from exact integer ratios as dictated by the data. We imposed additional constraints on the standard Gaussian mixture model fitting procedure of the model to ensure that the mapping of mixture components to integer ratios was fixed across groups, to avoid artefacts associated with small sample size and to avoid uninterpretable overlap between the modes.

We define a Gaussian mixture model with category centres {μi}i=1…K, covariance matrices {Ci}i=1…K and weights {wi}i=1…K as follows:

$$Q(x)=\sum _{i}\frac{{w}_{i}}{2\pi \sqrt{{\rm{|}}{C}_{k}{\rm{|}}}}\exp \left(-\frac{1}{2}{\left(x-{\mu }_{i}\right)}^{{\mathrm{T}}}{C}_{k}^{-1}(x-{\mu }_{i})\right)\,$$

To fit the model, we used a modified expectation–maximization algorithm120. We initialized the algorithm by assigning the mixture components to the small-integer-ratio rhythms within Ω22. We then proceeded by alternating between the expectation and maximization steps. After each maximization step, we applied the following additional constraints:

  1. (1)

    Mode identity: to guarantee that each mode was associated with the same category across groups, we required that \({\|{\mu }_{i}-{\mu }_{i}^{0}\|}_{2} < {\frac{1}{2}d}_{\min }\), where \({\mu }_{i}^{0}\) is the ith category in Ω22 and where dmin is the minimal distance between categories in Ω22. This constraint permits the modes to deviate substantially from integer ratios to faithfully represent bias in the data, but not so much that the correspondence with integer ratios is lost.

  2. (2)

    Overlap: to avoid overlap between the modes, we required that the eigenvalues of the covariance matrix λ1 and λ2 satisfy the constraint that \(\left|{{\lambda }}_{i}\right| < {d}_{\min }\).

  3. (3)

    Additional constraints on the overlap between the modes: we also required that λ1 and λ2 be limited by \(A < \frac{{\lambda }_{i}}{{\lambda }_{1}+{\lambda }_{2}} < 1-A\), where A is a constant. We fixed A = 1/5, which intuitively corresponds to a constraint on the aspect ratio of the ellipsoid defined by the covariance matrix.

We applied these constraints after the maximization step. We applied constraint 1 by projecting the μi resulting from maximization step to the closest point in Euclidean distance that satisfied constraint 1. Similarly, we applied constraints 2 and 3 on the eigenvalues of the covariance matrix by truncating them so they satisfied both constraints:

$${\lambda’ }_{i}=\min\left({\lambda }_{i},{\frac{1}{2}d}_{\min }\right)$$

$${\lambda^{\prime\prime}}_{i}=\min{\left(\max\left({\lambda^{\prime}}_{i},A({\lambda^{\prime}}_{1}+{\lambda^{\prime}}_{2})\right),(1-A)({\lambda^{\prime}}_{1}+{\lambda^{\prime}}_{2})\right)}$$

We then iterated these steps until convergence (using a convergence threshold of 1 × 10−6). The result was an estimate of {μi}i=1…K, {Ci}i=1…K and {wi}i=1…K. We emphasize that these constraints do not place strong limits on the locations of the modes, which are free to exhibit biases, or on the category boundaries, which need not be located symmetrically between modes. The constraints merely serve to enable us to label the modes in a consistent way. A Gaussian mixture model fit in this way to each group’s data explained most of the kernel density variability (91% on average, ranging from 80.6% to 97.5% depending on the group).

This procedure is different from the one reported in Jacoby and McDermott26, where we performed a numeric constraint optimization with the MATLAB fmincon function on the Kullback–Leibler divergence of the kernel density estimate defined by the model and the kernel density estimate of the data. Other than this difference, we had similar constraints on the optimization. The 2017 procedure provided comparable results but was slower than the method used here. Considering the large amount of data in this project, we considered the efficient expectation–maximization method to be preferable to direct optimization.

Gaussian mixture model with 7:2:3 category

In the case of Malian drummers and dancers, we added additional rhythm categories at 2:3:7, 7:2:3 and 3:7:2 (Figs. 5 and 7g–i). We denote this by Ω25 = Ω22{2:3:7,7:2:3,3:7:2} and used it instead of Ω22 for the Gaussian mixture model estimate. In all other respects, the analysis was identical to that in the previous section (Gaussian mixture model fit with Ω22).

Average category weights

In all cases, we fitted the Gaussian mixture to the categories (Ω22, except for Fig. 7g–i, where we used Ω25). In some cases, we wanted to display or analyse the results ignoring cyclic permutations of the same category (for example, 2:2:3, 2:3:2 and 3:2:2 would be mapped to the same category 223). We then computed the Gaussian mixture model fit to Ω22 and averaged the weight across the three permutations. The one exception was the isochronous rhythm 1:1:1, which has no variants; in this case we used the original fit of 1:1:1 in Ω22. This resulted in eight weights per group corresponding to the eight categories in Ω8 (defined above in ‘Small-integer ratios’). These category weights are reported in Extended Data Fig. 4 and are provided as part of the OSF repository associated with this publication.

Significant distance between two groups

In this analysis, we evaluated whether two distributions P1 and P2 associated with two groups had significantly different kernel density estimates. Since the Jensen–Shannon divergence is always positive, it deviates from zero when the kernel density estimates being compared are computed from a finite sample. We used bootstrapping to estimate whether the distance between P1 and P2 was greater than what is expected from this finite-sampling effect. We created 1,000 simulated split halves of the trials from each participant group. From these bootstrap samples, we estimated the 1,000 kernel density estimates associated with the two splits of each group (we denote them by \({P}_{1}^{\;j,1}\) and \({P}_{1}^{\;j,2}\), where j indexes the 1,000 split halves, and \({P}_{2}^{\;j,1}\) and \({P}_{2}^{\;j,2}\)). We then computed \({\rm{JSD}}\left({P}_{1}^{\;j,1},{P}_{2}^{\;j,1}\right)\) (the distance between split halves across groups) and compared it with \({\rm{JSD}}\left({P}_{1}^{\;j,1},{P}_{1}^{\;j,2}\right)\) and \({\rm{JSD}}\left({P}_{2}^{\;j,1},{P}_{2}^{\;j,2}\right)\,\). We assessed statistical significance via a P value from the minimum of the rank order of \({\mathrm{JSD}}\left({P}_{1}^{\;j,1},{P}_{2}^{\;j,1}\right)\) within the two null distributions for \({\mathrm{JSD}}\left({P}_{1}^{\;j,1},{P}_{1}^{\;j,1}\right)\) and \({\mathrm{JSD}}\left({P}_{2}^{\;j,1},{P}_{2}^{\;j,1}\right)\,\). Namely, to declare that two groups are significantly different, their mean Jensen–Shannon divergence had to be significant with respect to both within-group Jensen–Shannon divergences. We also computed the difference in Jensen–Shannon divergence:

$$\begin{array}{l}D=\left[\left({\rm{JSD}}\left({P}_{1}^{\;j,1},{P}_{2}^{\;j,1}\right)-{\rm{JSD}}\left({P}_{1}^{\;j,1},{P}_{1}^{\;j,1}\right)\right)+\left({\rm{JSD}}\left({P}_{1}^{\;j,1},{P}_{2}^{\;j,1}\right)\right.\right.\\\qquad\left.\left.-\,{\rm{JSD}}\left({P}_{2}^{\;j,1},{P}_{2}^{\;j,1}\right)\right)\right]/2.\end{array}$$

We report the mean of the difference (mean(D)) as well as the 95% CIs of D.

Discrete mode (‘peakiness’) analysis

We performed three analyses to substantiate the presence of discrete modes in the measured priors. In analysis 1, to show that the mass of the estimated density was centred in a small part of the space, we computed for each group the 33% of the bins with the highest kernel density and then computed the sum of the density in these bins relative to the sum over all bins. This resulted in numbers ranging between 61.8% and 81.7% (mean 70.1%) over the 39 groups. To obtain a null distribution for this quantity, for each group we sampled points (equal in number to the total number of trials for that participant group) randomly on the triangle and estimated the empirical kernel density estimate for this random distribution. We then repeated the selection process described above, picking the 33% of the bins with the highest kernel density and computing the proportion of the summed density in these bins. This analysis showed that the percentage obtained in this way from the empirical data was significantly larger than would be expected from the null distribution computed from uniform sampling.

In analysis 2, we estimated the peak density with respect to a uniform distribution. We identified the bin in the kernel density estimate with the highest density and found that in all 39 groups this bin had a density that was over five times larger than the density of the same bin under a uniform distribution (range, 5.3–13.1; mean, 8.8). We estimated the statistical significance of this ratio using a null distribution obtained by sampling points from a uniform distribution (using the same number of points per group) and measuring the peak density from the resulting kernel density estimate. We found that the empirical peak ratios were significantly larger than would be expected by chance for all 39 groups (P < 0.001 in each case).

In analysis 3, we fit a Gaussian mixture model with mixture components constrained to be near small-integer ratios (see ‘Gaussian mixture model fits’ for the details). This model explained most of the variance in the kernel density (91% on average, ranging from 80.6% to 97.5% depending on the group). Explained variance was measured here by treating the kernel density estimates of both the empirical data and the models as vectors and squaring the correlation between the vectors.

Overlap with small-integer ratios

We evaluated overlap with small-integer ratios using three different analyses (Fig. 4a). First, we computed the average minimal distance between each fifth-iteration reproduction (for all participants in a given group) and the closest small-integer-ratio rhythm: L2(Ω22, pi), where pi are the fifth-iteration reproductions represented on the rhythm triangle, Ω22 is the set of small-integer ratios involving the numbers 1–3 defined above and L2 is the minimum of the Euclidean distances between the point pi and each of the 22 points in Ω22. To show that these distances are significantly smaller than would be obtained by chance, we generated a null distribution by randomly sampling sets of 22 points uniformly from the triangle and computing the same mean minimal distance between the points pi and these randomized sets. We then compared the empirical distance to the null distribution. We used this first analysis for the results reported in Fig. 4a on the grounds that it is simple to describe.

Second, we performed an additional control analysis where instead of sampling the sets of 22 points uniformly, we constrained them so that each point fell within a circle of radius d around the integer points, where d is 1/2 of the minimal distance between two points in Ω22. This guaranteed that the null sets were spaced similarly to Ω22. The results of this alternative analysis were similar to those of the simpler analysis described above, and in all 39 cases the empirical distance was significantly smaller (P < 0.001) than would be expected from the null distribution, even when Bonferroni correction was applied.

Third, we applied the integerness score reported in Jacoby and McDermott26. In this analysis, we compared the Jensen–Shannon divergence distance between the empirical kernel density estimate of the fifth iteration (P) and the normalized indicating function \({I}_{{\varOmega }_{22}}(x)\) \(=\frac{1}{22}\sum _{\omega \in {\varOmega }_{22}}\delta \left(x-\omega \right)\), where δ is the Dirac delta function on the triangle. This Jensen–Shannon divergence is maximal if all probability mass is located at the small-integer ratios. We initially fitted an unconstrained Gaussian mixture model with 22 components. We then randomized the means of the components of this mixture. This simulates a response distribution that is similar in statistical characteristics to the data, but that is not centred around integer ratios. We obtained a null distribution by generating 1,000 such randomized distributions, each time computing the Jensen–Shannon divergence with \({I}_{{\varOmega }_{22}}(x)\). We then compared the distance of \({I}_{{\varOmega }_{22}}(x)\) and P to this null distribution. We found that in all cases the distance between \({I}_{{\varOmega }_{22}}(x)\) and P was significantly smaller (P < 0.01) than would be expected by chance.

Bias analysis

Figure 4b displays an analysis testing whether perceptual category centres are systematically biased away from the corresponding small-integer ratio. The small grey dots plot the component means of the fitted Gaussian mixture models for each category and participant group. We then calculated the empirical means (indicated with larger black dots) of each category across all groups. We applied a non-parametric test analogous to an analysis of variance to test whether each category was biased. The test statistic was the ratio of (1) the average squared Mahalanobis distance between all points and the empirical mean and (2) the average squared Mahalanobis distance between all points and the corresponding small-integer ratio. We compared this test statistic to its null distribution computed from 10,000 bootstrapped samples where the data were randomly sampled from a Gaussian distribution with the same empirical covariance matrix as the experimental data but with the mean set to the integer ratio category (that is, with zero bias).

Nine categories showed small but significant deviations from unbiased integer ratio categories after Bonferroni correction (P < 0.0012 for all cases; see the blue significance symbols in Fig. 4b: ***P < 0.001; **P < 0.01). The biased categories consisted of the three cyclic permutations of each of 1:2:3, 2:1:3 and 2:2:3 (123/231/312, 213/321/132 and 223/232/322). The bias of the ‘6/8’ categories 1:2:3 and 2:1:3 is consistent with lengthening of short elements in rhythm performance studied in European musicians42. It is also evident in the experimental literature on rhythm perception on Western European and North American listeners. Fraisse121 showed that non-musicians have a small bias when judging three-interval rhythms, tending to judge the two shorter intervals as being closer to equal. Repp et al.122 argue that this bias originates from categories near 1:2:3 and 2:1:3 that are slightly shifted away from those integer-ratio rhythms, in a direction that lengthens the shortest interval (a phenomenon we also observed, cross-culturally). The results are also consistent with the observation that the shortest interval of a two-interval rhythm is heard as elongated, making the rhythm more similar to isochrony123. It may also be related to the phenomenon of non-isochronous beat subdivision in African and African-American music genres (for example, jembe music from Mali and ‘swing’ jazz from the United States), in which the short interval in short-long rhythms is often markedly elongated relative to a 1:2 ratio124.

Multidimensional scaling analysis

To visualize the similarity relations between the rhythm priors for each participant group, we first estimated the priors as the kernel density estimate Pi from the fifth iteration of the experiment (aggregated for all participants in each group; see above; Fig. 5a). We then computed the Jensen–Shannon divergence between all pairs of groups Mij = JSD (Pi, Pj). We used MATLAB’s mdscale function with the default parameters to obtain a two-dimensional space in which the rhythm prior for each group was positioned so as to best match the measured distances. Note that we used the distances between the full distributions (that is, the kernel density estimates) for the multidimensional scaling analysis (as opposed to the Gaussian mixture models used in other analyses).

Category weight for the 3:3:2 rhythm

To compute the weight of the 3:3:2 rhythm for each group, we computed the Gaussian mixture model weights as explained above for the 22 rhythm categories and then averaged the weights over the three cyclic permutations of 3:3:2 (3:3:2, 3:2:3 and 2:3:3; Fig. 5b). We obtained error bars via bootstrapping, sampling 1,000 datasets with replacement for each group and computing the weights of the Gaussian mixture model for each of these datasets. The error bars plot one standard deviation of the resulting distribution above and below the mean (that is, the standard error of the mean). The order of the groups in the bar graph of Fig. 5b is drawn from the first dimension of the multidimensional scaling analysis, and it is obvious that this dimension is correlated with the 3:3:2 category weight (which increases nearly monotonically across the first multidimensional scaling dimension).

Analysis of student and online groups

In the first analysis, we computed the average distance (Jensen–Shannon divergence) between the estimated priors of all pairs of student or online groups in different countries and compared it to that of pairs of non-musician and local musician groups (from the same countries as the student/online groups; Fig. 6a,d). The pairs we considered were within the following sets of groups:

  • Figure 6a: for students, US (Boston)-ST, Bolivia (La Paz)-ST, Uruguay-ST, UK-ST, Turkey-ST, Mali-ST, S. Korea-ST and Japan-ST; for non-online groups, US(NYC)-NM, US(NYC)-LM, Bolivia (La Paz)-NM, Bolivia (San Borja)-NM, Bolivia (Santa Cruz)-NM, Bolivia (Tsimane)-NM, Uruguay-LM, UK-LM, Turkey-LM, Mali-LM, S. Korea-LM and Japan-LM.

  • Figure 6d: for online groups, US-OL, Brazil-OL and India-OL; for non-students, US (NYC)-NM, US (NYC)-LM, Brazil-LM, India-NM and India-LM.

To evaluate the statistical significance of the difference in distances, we created shuffled datasets where two sets of groups (one the same size as the student/online set and one the same size as the non-student/non-online set) were sampled without replacement from the union of the student/online and non-student/non-online sets. We then computed the difference between the average Jensen–Shannon divergences of these shuffled groups for each resampling and evaluated the probability of the actual difference under this null distribution.

In the second analysis, we computed the average distance (Jensen–Shannon divergence) between the US student group (US(Boston)-ST) and the priors of all other student/online groups (student: Bolivia(La Paz)-ST, Uruguay-ST, UK-ST, Turkey-ST, Mali-ST, S.Korea-ST and Japan-ST; online: US-OL, Brazil-OL and India-OL). We compared this average distance to a null distribution obtained by sampling sets of non-student/non-online groups of the same size (student: seven groups; online: three groups) and measuring the average distance of each set of groups and the US student group.

To control for the fact that student groups tended to be younger than other groups, we repeated the above two analyses restricted to participants younger than 40. The group differences in mean age were not eliminated by this restriction, but they were significantly reduced (with this restriction, all groups had mean ages between 21 and 33.7 years and could all be considered ‘young’). The statistics reported in ‘Students and online participants resemble US participants’ in ‘Results’ use pairwise bootstrapped Jensen–Shannon divergence (see ‘Significant distance between two groups’).

We also performed a control analysis testing for effects of age by comparing the differences in the kernel densities between young and old subsets of the online groups. Given that over a third—specifically, 46%, 34% and 39% in the US, Indian and Brazilian groups, respectively— of participants in the online groups were older than 35, we divided all groups into younger and older subsets using the age of 35 as a threshold. We found that kernel density estimates for the US, Indian and Brazilian groups were not significantly different (see ‘Students and online participants resemble US participants’ in ‘Results’). These results suggest that age does not explain the increased similarity of student populations.

Word clouds of favourite music

In the demographic questionnaire, we asked participants to list their three favourite bands or musical artists and to indicate the genre for each (Fig. 6b,c,e,f). The level of detail varied somewhat between individuals and groups (for example, some individuals specified sub-genres such as ‘indie-rock’, whereas others indicated ‘rock’). Due to site-specific limitations on the experiment session duration, this question was asked of only 31 of the 39 groups. Text entries were verified by searching each entry in the Google Knowledge Graph Search API (https://developers.google.com/knowledge-graph). In case of items with incomplete matches, spelling errors were manually corrected. We then analysed the results using single-word histograms. Some of these histograms are presented as word clouds in Fig. 6, with the font size proportional to the frequency of occurrence. In addition, the most common words for each group are presented in Supplementary Table 2. This analysis is qualitative but nonetheless provides concrete evidence for the differences in musical listening habits between student/online and non-student/non-online groups.

Analysis of specific modes

Figure 7 analyses the prominence of particular rhythm modes in different participant groups. For each group, we computed the Gaussian mixture model weights as explained above for the 22 rhythm categories and then averaged the weights over the three cyclic permutations of the rhythm in question. We then examined these weights for participant groups whose local musical tradition was known to feature the rhythm. We asked whether the weights were higher than in the remaining participant groups using a Wilcoxon rank-sum test.

The 2:2:3 and 3:3:2 rhythms had been previously associated with specific musical traditions. 2:2:3 has been documented in Balkan, Turkish and Botswanan music49,50,51, which often employ metres with a signature of 7/8. Balkan and Turkish listeners have also been shown to better discriminate this pattern than US and Canadian participants without such familiarity53,125. The 3:3:2 rhythm is similarly ubiquitous across sub-Saharan Africa54,55 and the African diasporas56,57,58,59 in the Americas. We confirmed its presence in the musical culture of our Malian dancer participants (Mali-DA, recruited among farmers from Sagele village in Southern Mali) by recording and analysing a representative corpus of their musical repertoire. The pieces chosen were ones to which they frequently danced in the context of wedding celebrations and other local events. We found that 46% of the recorded excerpts prominently featured a 3:3:2 pattern, making it one of the most characteristic rhythmic patterns in this repertoire.

The 7:2:3 rhythm evident in the priors measured from drummers in Mali (Fig. 7g–i) is popular in West Africa; a slightly denser, five-interval variant (2:2:3:2:3) constitutes a signature rhythm that is emblematic of the musical culture area54,55. Drumming in Mali is multi-part ensemble music composed of three basic parts: an improvising lead drum, a simple invariant accompaniment and a ‘timeline’ part, whose specific rhythm patterns identify each piece of repertoire126. The ‘Maraka’ is the most frequently performed piece in their repertoire127. One characteristic timeline pattern for the Maraka consists of three accented events that are distributed according to a 7:2:3 pattern across a periodicity composed of 12 metric units (7 + 2 + 3). This pattern is often performed by the timeline player, who alternates during the piece between this pattern and other variants with similar accents. Additionally, we substantiated its presence in Malian music as described in ‘Results’. We used the same procedure to validate participant responses in Bulgaria (2:2:3 rhythm).

Violin plots

To generate violin plots (used in Figs. 7 and 8 and Extended Data Fig. 8), we used Bastian Bechtold’s Violin Plots for MATLAB package (https://github.com/bastibe/Violinplot-Matlab, https://doi.org/10.5281/zenodo.4559847). The open circle plots the median, and the top and bottom of the grey bar plot the 75th and 25th percentiles. The violin plots are kernel density estimates of the data distribution. Whiskers (thin lines) are computed using Tukey’s method128 and reflect the range of non-outlier points.

Tapping precision and asynchrony in musicians and non-musicians

To compare objective precision in our task between musicians and students/non-musicians, we used Wilcoxon tests (one-sided), again Bonferroni-corrected (Fig. 8). For the musician groups, we included both those playing Western music and those playing local musical styles. The measure used to assess tapping precision was the standard deviation of the tapping asynchrony (the time difference between a stimulus click and the corresponding tapped response)3, computed over all valid tapped responses in the main experiment. We also compared the mean of the tapping asynchrony, again computed over all valid tapped responses in the main experiment. The negative mean asynchrony reflects the tendency of taps to occur before the stimulus (in anticipation of the upcoming stimulus).

Cross sections of priors

To see the structure of the modes of the priors, we used an alternative visualization. Extended Data Fig. 1 displays 1D plots of estimated priors from four groups: three that show elongated modes (BO.TS, IN.OL and UY-ST) and, as a comparison, one group with more symmetric modes (UY-LM). We also show 2D and 3D plots of the priors for comparison. The 3D plots were generated with MATLAB’s surf function.

Cyclic permutations and an analysis of symmetry

Across the groups we tested, the response distributions were typically fairly symmetric across cyclic permutations (Fig. 3 and Extended Data Fig. 3). For example, the modes at 1:1:2, 1:2:1 and 2:1:1 have about the same weight for a given participant group. To quantify this symmetry, we compared the percentage of responses in the final iterations that are in each of the three possible cyclic permutations, which can be identified by whether the longest interval is in the first, second or third position, defined relative to the beginning of the stimulus (Extended Data Fig. 3a).

As is evident in Extended Data Fig. 3b, the deviations from perfect symmetry were relatively modest (perfect symmetry would yield 33.3% of tapped responses in each third of the space; actual proportions ranged from 24.3% to 43.6%; the standard deviation of the difference from 33.3% was 3.2%). However, these deviations appeared to be non-random, with a tendency for more weight on permutations in which the third interval is the longest (red region). Previous literature43,44,45 in fact predicts that the most frequently occurring permutations would be those where the long interval occurs at the end, because if this configuration is played cyclically, the long interval provides a gap that helps the pattern group according to Gestalt principles43. This was the case in 31 of 39 groups (those for which the green area in Extended Data Fig. 3b extends beyond the horizontal line; this number is much greater than would be expected by chance, P < 0.001 via a binomial test; the mean percentage of long-interval-at-the-end patterns was 36%; Cohen’s d = 0.73; 95% CI, (34.7%, 36.5%)). We also found that in 33 of the 39 groups (again much greater than expected by chance, P < 0.001 via a binomial test), the majority of the first taps within a block occur right after the long interval (the mean percentage across groups of tapping after the long interval was 47%; 95% CI, (43.6%, 50.7%); Cohen’s d = 1.2). This suggests that most participants in most groups tend to perceive the onset after the long interval as the ‘beginning’ of the pattern and align their first response to it.

What can explain the overall tendency towards symmetry? In principle, the symmetry could reflect the fact that the beginning of a repeating cycle is ambiguous if participants ignore or forget the initial interval. In an earlier paper26, we tested whether this ambiguity underlies the symmetry evident in the response distributions. Specifically, we performed an experiment in which the first click of the repeating stimulus was given a 7 dB level increment to render the permutations distinct. The results (Supplementary Fig. 5 in that paper) indicated that the symmetry of the response distribution was largely maintained, suggesting that perceptual ambiguity is not the only reason for symmetry in the prior.

Another possibility is that the perception of simple periodic rhythms is influenced by grouping multi-stability, whereby the same stimulus rhythm can be perceived in different sequential arrangements given that starting points are subjective and may be interchangeable to some degree129,130. For instance, a listener might hear the first element of a stimulus as an ‘upbeat’ (anacrusis) preparing the second element to represent the perceptual beginning67,95. Under this interpretation, cyclic permutations of an interval pattern have different beginnings but would otherwise be perceptually similar.

Category weights

Categorization weights were computed for each group for the seven category types (Extended Data Fig. 4). As part of the OSF repository, we have provided the raw data for these category estimates.

Multidimensional scaling and category weights

We computed the correlation between projections of the priors onto the multidimensional scaling dimensions and each of the category weights for each group, averaged across cyclic permutations (Extended Data Fig. 5). The multidimensional scaling projections and Gaussian mixture model weights were computed as described above. The correlation was computed across the 39 groups. The CIs were obtained using the RIN method131.

Principal component analysis

In Fig. 3, we used multidimensional scaling to perform dimensionality reduction, computing the Jensen–Shannon divergences between the kernel density estimates of all pairs of groups. As an alternative, we performed an analogous analysis using principal component analysis (Extended Data Fig. 6). We treated the triangle images generated from the kernel density estimates as large feature vectors, where each pixel is a feature (using kernels with a resolution of 12 ms—that is, 0.006× the pattern duration of 2,000 ms). We computed the principal components of these vectors across 39 groups. The projections of the 39 groups’ priors onto the first two principal components showed a structure very similar to what we obtained with multidimensional scaling (Extended Data Fig. 6a). For example, it is apparent that student groups were again centred in the middle. As with multidimensional scaling, the projection of each group’s estimated prior onto the first principal component was positively correlated with the 2:3:3 category (r37 = 0.94; P < 0.0001; 95% CI, (0.89, 0.97)) and negatively correlated with the simpler categories (1:1:1: r37 = −0.44; P = 0.04; 95% CI, (−0.66, −0.14); 1:1:2: r37 = −0.59; P < 0.001; 95% CI, (−0.76, −0.33), 1:2:2: r37 = −0.70; P < 0.001; 95% CI, (−0.83, −0.49); Extended Data Fig. 6c). The projection onto the second principal component was correlated with 6/8 rhythms (1:3:2: r37 = 0.66; P < 0.001; 95% CI, (0.44, 0.81); 1:2:3: r37 = 0.74; P < 0.001; 95% CI, (0.56, 0.86)) as well as the 1:1:2 rhythm (r37 = −0.63; P < 0.001; 95% CI, (−0.79, −0.4)). The CIs were obtained using the RIN method131. The components can also be visualized (Extended Data Fig. 6b), revealing that their minima and maxima overlap with small-integer-ratio categories. The consistency between the different dimensionality reduction methods indicates the robustness of the results. The raw data for the category fitting and category weight for each group are also provided in the OSF repository associated with this publication.

Category predictions from rhythm priors

In this analysis (Extended Data Fig. 7), we used human psychophysical data previously obtained and published by Desain and Honing9 (the data were available on a website: https://www.mcg.uva.nl/index.html). In their experiment, 29 Western musicians heard one of 66 rhythms (an equally spaced array of points on the rhythm triangle) and used notation software to specify the rhythm that they heard. Because this experiment used Western musical notation, it was possible only in Western musicians. The 29 participants were highly trained professional musicians and advanced conservatory students from Dutch conservatories and from the Kyoto City University of the Arts in Japan. They had received between 7 and 17 years of musical training and were paid for their participation.

When the data were pooled across participants, there were 133 different responses in total. Each response can be expressed in ratio form. For instance, the most common response was 1:1:1, and the second most common response was 1:2:1. The results of the experiment were summarized as a set of regions associated with each musically notated rhythm as the most frequent category choice (Extended Data Fig. 7a, which was our reproduction of Fig. 11 from the Desain and Honing paper using the data we downloaded). To obtain this figure, we followed the two steps below:

  1. (1)

    For each of the 133 responses, we created a kernel density plot representing the interpolated probability of this response at each point on the rhythm triangle. We used a kernel width of 0.03.

  2. (2)

    We found the response with the largest interpolated weight at each point on the triangle.

The resulting figure contains 17 distinct rhythm categories spread over the rhythm triangle.

We generated analogous regions for the model’s categorization judgements using each group’s prior (Extended Data Fig. 7b). We used the Gaussian mixture model that we previously fitted to the tapping data (see ‘Gaussian mixture model fits’). This model is defined by three parameters: category centres {μi}i=1…22, covariance matrices {Ci}i=1…22 and weights {wi}i=1…22, which approximate the priors from the tapping data:

$$Q(x)=\mathop{\sum }\limits_{i=1}^{22}\frac{{w}_{i}}{2\pi \sqrt{{\rm{|}}{C}_{k}{\rm{|}}}}\exp \left(-\frac{1}{2}{\left(x-{\mu }_{i}\right)}^{{\mathrm{T}}}{C}_{k}^{-1}(x-{\mu }_{i})\right)\,$$

The model selected the category whose corresponding mixture component had the highest value at each point in the triangle. But on the basis of empirical findings that human categorical judgements are best predicted by a nonlinear transform of the underlying probability distribution132,133, we used mixture weights that depended exponentially on the prior weights:

$${U}_{i}(x)=\frac{{w}_{i}^{\gamma }}{2\pi \sqrt{{\rm{|}}{C}_{k}{\rm{|}}}}\exp \left(-\frac{1}{2}{\left(x-{\mu }_{i}\right)}^{{\mathrm{T}}}{C}_{k}^{-1}(x-{\mu }_{i})\right)\,$$

where γ > 1 is a parameter that prioritizes high-probability categories. We selected the value of γ as that which maximized the match between the human category judgements and those predicted by the prior estimated from the New York Western musician group (US.NY-WM), yielding γ = 7. We then omitted the US.NY-WM group from the subsequent analysis to avoid non-independence. Additionally, we found empirically that the category 1:1:1 was overrepresented in the human categorization judgements relative to those of the model. We note that this category is unique in that all three cyclic permutations correspond to the same point on the rhythm triangle, which might cause participants to choose it more than other categories. To accommodate this effect, we increased the weight wi on the 1:1:1 category by a factor of three. However, the cross-cultural differences shown in Extended Data Fig. 7 were not dependent on this choice (we observed significant differences in the match of Western groups compared with the two non-Western groups in both cases—the matches were just overall worse without the overweighting).

We quantitatively assessed the match between the categories predicted by a group’s prior and those measured in Western musicians as the average distance between the two predicted categories. Specifically, for every sampled point on the rhythm triangle, we compared the model prediction with the top category selected by participants in the Desain and Honing experiment, and we measured the L2 distance in the three-dimensional space of the three intervals in the rhythm, expressed as ratios (proportions of the total pattern duration). For instance, if the participants selected the category [0.5, 0.25, 0.25] and the model predicted [0.5, 0.2, 0.3], then the distance was ||(0, −0.05, 0.05)||. We then averaged this distance for each rhythm in the experiment (that is, every sampled point on the rhythm triangle) to yield an overall measure of the match between the model’s predictions and Western musician category judgements (plotted in Extended Data Fig. 7c). To compare the accuracy of this match across sets of groups, we used a Wilcoxon rank-sum test. We used three sets of groups, defined as follows: Western participants (US.BO-ST, US.BO-WM, US.NY-NM, US.NY-WM, BO.LP-ST, UY-ST, UK-ST, TK-ST, MA-ST, KR-ST, KR-WM, JP-ST, JP-WM and US-OL), non-Western non-musicians (BO.LP-NM, BO.SB-NM, BO.SC-NM, BO.TS-NM, NA-NM, IN-NM, CN-NM, BR-OL and IN-OL) and non-Western musician and dancers (BR-LM, UY-LM, SE-LM, BG-LM, BG-DA, TK-LM, MA-LM, MA-DA, BW.SA-LM, BW.EA-LM, NA-LM, IN-LM, KR-LM and JP-LM). We excluded jazz musicians in the United States and the United Kingdom as there was not a natural hypothesis regarding how their priors would predict the categories of Western classical musicians.

Validation of musicianship

To compare self-reported years of musical experience between musician and non-musician groups (Extended Data Fig. 8), we used Wilcoxon tests (one-sided), applying Bonferroni correction for multiple comparisons.

Influence of language and musicianship

In this analysis, we evaluated whether two groups that spoke the same or different languages (or that differed in musicianship) had significantly different kernel density estimates (Extended Data Fig. 9). We used the procedure described in ‘Significant distance between two groups’ (bootstrapped Jensen–Shannon divergence).

Transmission error

Transmission error is the magnitude of the difference between the stimulus and response seeds in each iteration. It is used in the serial reproduction literature to monitor convergence dynamics134,135. As an error measure, we computed the average across trials of \(e=\sqrt{{{(s}_{1}-{r}_{1})}^{2}+{{(s}_{2}-{r}_{2})}^{2}+{{(s}_{3}-{r}_{3})}^{2}}\), where (s1, s2, s3) and (r1, r2, r3) are the stimulus intervals and average response intervals of each iteration, respectively (that is, the response is averaged across the ten repetitions within each iteration). In our previous work, we showed that convergence occurred after about five iterations for both Tsimane’ and US participants. Here we show similar dynamics for all groups (Supplementary Fig. 2).

High-resolution prior visualizations

In Supplementary Figs. 3–9, we provide higher-resolution images for the measured priors presented in Fig. 3b and Extended Data Fig. 2.

Fast-tempo experiment

Procedure

When experimental conditions allowed for longer sessions, we ran an additional experiment to explore whether the results would be similar at other tempos. The experiment was always run last, and was identical to the main experiment except that the pattern duration was 1,000 ms. The other experimental constants (for example, the fastest allowed interval) were scaled accordingly (the experiment was identical to the fast-tempo experiment in Jacoby and McDermott26, experiment S2, shown in Supplementary Fig. 3 of that paper).

Participants

A total of 293 participants from 13 groups (6 countries) participated in the fast-tempo experiment. These participants performed 7,587 trials (seeds) with 911,564 taps. The demographic information for these participants is summarized in Supplementary Table 1.

Analysis of results

The kernel density estimates of the 13 groups are provided in Extended Data Fig. 2. We provide the raw data of the experiment in the OSF repository associated with this publication.

Overall, the results at the faster tempo were similar to those at the slower tempo. All 13 groups who performed the fast-tempo experiment produced priors that were closer to integer ratios than would be obtained by chance (P < 0.001 in all cases), even with Bonferroni correction.

Supplementary Fig. 1 shows an analysis of the weights of the modes in the 13 groups. The weights of the 22 categories were correlated across the two tempos (r = 0.35–0.72 for each of the 13 groups; P = 0.0001–0.05; mean r = 0.57; s.d. = 0.1). As expected from previous literature, there were also some subtle differences between the category weights for the two tempos (Supplementary Fig. 1). Three of the four largest effect sizes were found in dancers (Bulgarian dancers: effect size of 5.5—more weight on category 2:2:3 in the fast tempo; Malian dancers: effect size of 5.6—more weight on 1:2:3 in the fast tempo; Malian dancers: effect size of 4.9—decreased weight on 3:3:2 in the fast tempo). These tempo-dependent effects in dancers are consistent with the idea that dancers have an increased sensitivity to tempo and to embodied aspects of music136,137. For instance, Bulgarian dancers showed much more weight on 2:2:3 at the faster tempo. Bulgarian folklorists have long recognized tempo as an important factor in metrical patterns that feature a 2:2:3 ratio, such that the metric durations are considered fundamentally unequal only when performed at fast tempos138,139. This idea was also reflected in one of the interviews we conducted with the musicians after the experiment. When we asked one participant whether she recognized the 2:2:3 pattern with a period of 2,000 ms, she identified it as the rhythm of a Bulgarian dance type called rŭchenitsa, but slower than usual. The effect of tempo in Malian dancers’ 1:2:3 category is similarly consistent with the findings of Polak et al.27, who showed that reproductions of short–long patterns in two-interval rhythms (the first part of the 1:2:3 pattern) strongly vary with tempo. This pattern is characteristic of the three most common Malian jembe musical pieces: Maraka, Suku and Manjanin, which are typically performed at a very fast tempo (100–200 beats per minute)126.

As the tempo increases, one might expect to see effects related to whether the temporal intervals in a rhythm are readily producible by humans93. We did not see clear evidence for this at the 1,000 ms tempo (specifically, the rhythms with the shortest intervals—123 and 132—did not have significantly lower category weights for 1,000 ms than for 2,000 ms, as evaluated with a binomial test), though there were some trends in this direction. It seems likely that for sufficiently fast tempos, and with enough data, such effects would be detectable.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Rate this post

Leave a Comment