Hacking Alexa's Voice Recordings

Michael Haephrati

5.00/5 (14 votes)

Nov 7, 2019

CPOL

4 min read

48927

538

Now you can store your own voice recordings kept by Amazon but not made available to customers

Introduction

I have 2 Amazon Echo devices (one was given to my daughter who goes to college in another country). Amazon has recently confirmed that the voice recordings produced by customers of the Amazon Alexa smart assistant are held forever unless users manually remove them. After looking into this, I tried to find a way to download my data. Filing a formal request to Amazon led to an email "approving" my request however none of my recordings were including in the data... After inquiring with customer service, I was told that one can only hear or delete his/her recordings but there is no option to download it. In other words, if you use an Amazon Alexa device, Amazon holds all your recording files but you can't get them. Well, now you can with the Python script we developed that does exactly that.

Background

If you have an Alexa device, you can just go to https://alexa.amazon.com/ and then press Settings, then History.

You will then be able to view any interaction you had with Alexa. That includes unsuccessful ones, such as when Alexa didn't understand you, or just recorded a private conversation not meant to her (and that happens from time to time). These entries can be expanded and in most cases, they contain a small "Play" icon which you can press and hear the conversation. There is no option to download these recordings. You can however delete them.

I wasn't interested in deleting it because I think it's quite cute to be able to listen to all kinds of conversation while Alexa is listening and documenting (almost) everything. What bothered me was my inability to download these recordings. The Python script we developed does that job, along with giving each recording a logical file name based on the date and time along with the conversation title.

Here is how it looks like.

Preparing the Python Environment

Preparations for the rest of the process, require installing Python and several libraries.

Download and Installing Python

Use the following link to download.
After installing it, add the path to the installation location to the PATH Environmental Variables.
The default location will be:
- C:\Users\<Your user name>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.7-32
You can add this entry to PATH using the following command:
```
set path "%path%;c:\Users\<YOUR USER NAME>\AppData\Local\Programs\Python\Python37-32"
```

Open Command Line (CMD) and enter the following line:

python -m pip install selenium pygithub requests webdriver_manager

You may get the following warning. To ensure our script will run smoothly, add the following entry:

setx path "%path%;c:\Users\<YOUR USER NAME>\
AppData\Local\Programs\Python\Python37-32\Scripts"

This will install the following extensions:

Selenium - used for automation in general
PyGithub - used for interfacing with the Github API
Requests - used for HTTP communication
webdriver_manager - Python Webdriver Manager. Used to access various web browsers.

The credentials.py File

We use a separate file where you can enter your Amazon credentials so the script can automatically log in to your account.

class Credentials:
    email = '*****'
    password = '******'

Running the Script

Type:

python alexa.py

How It Works

The script goes as follows:

Logging in to Alexa

The following function is used to log in to your Alexa history through your Amazon account:

def amazon_login(driver, date_from, date_to):
    driver.implicitly_wait(5)
    logger.info("GET https://alexa.amazon.com/spa/index.html")
    # get main page
    driver.get('https://alexa.amazon.com/spa/index.html')
    sleep(4)
    url = driver.current_url
    # if amazon asks for signin, it will redirect to a page with signin in url
    if 'signin' in url:
        logger.info("Got login page: logging in...")
        # find email field
        # WebDriverWait waits until elements appear on the page
        # so it prevents script from failing in case page is still being loaded
        # Also if script fails to find the elements (which should not happen
        # but happens if your internet connection fails)
        # it is possible to catch TimeOutError and loop the script, so it will
        # repeat.
        check_field = WebDriverWait(driver, 30).until(
                EC.presence_of_element_located((By.ID, 'ap_email')))
        email_field = driver.find_element_by_id('ap_email')
        email_field.clear()
        # type email
        email_field.send_keys(Credentials.email)
        check_field = WebDriverWait(driver, 30).until(
                EC.presence_of_element_located((By.ID, 'ap_password')))
        # find password field
        password_field = driver.find_element_by_id('ap_password')
        password_field.clear()
        # type password
        password_field.send_keys(Credentials.password)
        # find submit button, submit
        check_field = WebDriverWait(driver, 30).until(
                EC.presence_of_element_located((By.ID, 'signInSubmit')))
        submit = driver.find_element_by_id('signInSubmit')
        submit.click()
    # get history page
    driver.get('https://www.amazon.com/hz/mycd/myx#/home/alexaPrivacy/'
               'activityHistory&all')
    sleep(4)
    # amazon can give second auth page, so repeat the same as above
    if 'signin' in driver.current_url:
        logger.info("Got confirmation login page: logging in...")
        try:
            check_field = WebDriverWait(driver, 30).until(
                    EC.presence_of_element_located((By.ID, 'ap_email')))
            email_field = driver.find_element_by_id('ap_email')
            email_field.clear()
            email_field.send_keys(Credentials.email)
            check_field = WebDriverWait(driver, 30).until(
                    EC.presence_of_element_located((By.ID, 'continue')))
            submit = driver.find_element_by_id('continue')
            submit.click()
            sleep(1)
        except:
            pass
        check_field = WebDriverWait(driver, 30).until(
                EC.presence_of_element_located((By.ID, 'ap_password')))
        password_field = driver.find_element_by_id('ap_password')
        password_field.clear()
        password_field.send_keys(Credentials.password)
        check_field = WebDriverWait(driver, 30).until(
                EC.presence_of_element_located((By.ID, 'signInSubmit')))
        submit = driver.find_element_by_id('signInSubmit')
        submit.click()
        sleep(3)
        logger.info("GET https://www.amazon.com/hz/mycd/myx#/home/alexaPrivacy/"
                   "activityHistory&all")
        # get history page again
        driver.get('https://www.amazon.com/hz/mycd/myx#/home/alexaPrivacy/'
                   'activityHistory&all')
    # find selector which allows to select Date Range 
    check = WebDriverWait(driver, 30).until(
            EC.presence_of_element_located(
                (By.CLASS_NAME, "a-dropdown-prompt")))
    history = driver.find_elements_by_class_name('a-dropdown-prompt')
    history[0].click()
    check = WebDriverWait(driver, 30).until(
            EC.presence_of_element_located(
                (By.CLASS_NAME, "a-dropdown-link")))
    # click 'All History'
    all_hist = driver.find_elements_by_class_name('a-dropdown-link')
    for link in all_hist:
        if date_from and date_to:
            if 'Custom' in link.text:
                link.click()
                from_d = driver.find_element_by_id('startDateId')
                from_d.clear()
                from_d.send_keys('11/03/2019')
                sleep(1)
                to_d = driver.find_element_by_id('endDateId')
                to_d.clear()
                to_d.send_keys('11/05/2019')
                subm = driver.find_element_by_id('submit')
                subm.click()
        elif 'All' in link.text:
            link.click()

Enabling Downloads

The following function enables downloads:

def enable_downloads(driver, download_dir):
    driver.command_executor._commands["send_command"] = (
        "POST", '/session/$sessionId/chromium/send_command')
    params = {'cmd': 'Page.setDownloadBehavior', 
    'params': {'behavior': 'allow', 'downloadPath': download_dir}}
    command_result = driver.execute("send_command", params)

Initializing the Driver

The following function initializes the Chrome driver.

def init_driver():
    logger.info("Starting chromedriver")
    chrome_options = Options()
    # use local data directory
    # headless mode can't be enabled since then amazon shows captcha
    chrome_options.add_argument("user-data-dir=selenium") 
    chrome_options.add_argument("start-maximized")
    chrome_options.add_argument("--disable-infobars")
    chrome_options.add_argument('--disable-gpu')  
    chrome_options.add_argument('--remote-debugging-port=4444')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument("--mute-audio")
    path = os.path.dirname(os.path.realpath(__file__))
    if not os.path.isdir(os.path.join(path, 'audios')):
        os.mkdir(os.path.join(path, 'audios'))
    chrome_options.add_experimental_option("prefs", {
        "download.default_directory": os.path.join(path, 'audios'),
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    })
    try:
        driver = webdriver.Chrome(
            executable_path=ChromeDriverManager().install(), 
            options=chrome_options, service_log_path='NUL')
    except ValueError:
        logger.critical("Error opening Chrome. Chrome is not installed?")
        exit(1)
    driver.implicitly_wait(10)
    # set downloads directory to audios folder
    enable_downloads(driver, os.path.join(path, 'audios'))
    return driver

Downloading the Contents of a Page

Per each page, we fetch all recordings and download them. Since there is no direct way to download these recordings but only to play them, this is where we hack a bit...

We basically extract an ID attribute which then becomes a part of the download link.

The ID attribute looks approximately like this (can vary):

audio-Vox:1.0/2019/10/27/21/1d2110cb8eb54f3cb6

In this example, 2019/10/27/21 is the date stamp, and the entire ID is being added to the link which is used for downloading this specific audio recording.

We also use additional info stored in element with class summaryCss.

If there is no additional info, then the element will be named as 'audio could not be understood'.

def parse_page(driver):
    links = []
    # links will contain all links harvested from one page
    check = WebDriverWait(driver, 30).until(EC.presence_of_element_located(
                                                   (By.CLASS_NAME, "mainBox")))
    boxes = driver.find_elements_by_class_name('mainBox')
    # mainBox corresponds to each element with audio recording
    for box in boxes:
        # if there is no voice, element can be detected by its class and skipped
        non_voice = box.find_elements_by_class_name('nonVoiceUtteranceMessage')
        if non_voice:
            logger.info('Non-voice file. Skipped.')
            continue
        non_text = box.find_elements_by_class_name('textInfo')
        if non_text:
            if 'No text stored' in non_text[0].text:
                logger.info("Non-voice file. Skipped.")
                continue
        # else we can find audio element and extract its data
        check = WebDriverWait(driver, 30).until(EC.presence_of_element_located(
                                                       (By.TAG_NAME, "audio")))
        audio_el = box.find_elements_by_tag_name('audio')
        for audio in audio_el:
            try:
                attr = audio.get_attribute('id')
                # we extract ID attribute which then becomes a part of the link.
                # ID approximately looks like this (can vary):
                # audio-Vox:1.0/2019/10/27/21/1d2110cb8eb54f3cb6
                # here 2019/10/27/21 is the date, and the whole ID is being
                # added to the link to download said audio recording.

                # Additional info is stored in element with class summaryCss.
                # If there is no additional info then the element will be named
                # as 'audio could not be understood'.
                get_name = box.find_elements_by_class_name('summaryCss')
                if not get_name:
                    get_name = 'Audio could not be understood'
                else:
                    get_name = get_name[0].text
                # subInfo element contains date and device data which we extract
                check = WebDriverWait(driver, 30).until(
                    EC.presence_of_element_located((By.CLASS_NAME, "subInfo")))
                subinfo = box.find_elements_by_class_name('subInfo')
                time = subinfo[0].text
                # extracting date from ID attribute, since it is easier.
                get_date = re.findall(r'\/(\d+\/\d+\/\d+\/\d+)\/', attr)
                try:
                    # replace slashes to -.
                    get_date = get_date[0].strip().replace('/', '-')
                except IndexError:
                    try:
                        # in case there is no date in the attribute
                        # (which should not happen anymore)
                        # we extract date from subInfo element and turn it
                        # into normal, easy for sorting date, e.g 2019/10/11.
                        get_date = re.findall(
                            r'On\s(.*?)\s(\d{1,2})\,\s(\d{4})', time)
                        month = get_date[0][0]
                        new = month[0].upper() + month[1:3].lower()
                        month = strptime(new,'%b').tm_mon
                        get_date = f"{get_date[0][2]}-{month}-{get_date[0][1]}"
                    except IndexError:
                        get_date = re.findall(r'(.*?)\sat', time)
                        day = get_date[0]
                        if 'Yesterday' in day:
                            day = datetime.now() - timedelta(days=1)
                            day = str(day.day)
                        elif 'Today' in day:
                            day = str(datetime.now().day)
                        day = day if len(day) == 2 else '0'+day
                        curr_month = str(datetime.now().month)
                        curr_month = curr_month if len(
                                            curr_month) == 2 else '0'+curr_month
                        curr_year = datetime.now().year
                        get_date = f"{curr_year}-{curr_month}-{day}"
                # Extract exact time and device
                find_p0 = time.find('at')
                find_p1 = time.find('on')
                get_time = time[find_p0+2:find_p1-1].replace(':', '-')
                device = time[find_p1:]
                get_name = get_name
                # Form element name
                name = f"{get_date} {get_time} {get_name} {device}"
                # Strip all dangerous symbols from the name.
                # Dangerous symbols are symbols which Windows can not accept
                name = re.sub(r'[^\w\-\(\) ]+', '', name)
                # Allow maximum 3 duplicates
                # if there is such element already, 
                # (1)+n will be added to its name.
                for link in links:
                    if name == link[1]:
                        name += ' (1)'
                        break
                dup = 1
                while dup <= 3:
                    for link in links:
                        if name == link[1]:
                            name = name.replace(f"({dup})", f"({dup+1})")
                    dup += 1
                print("_"*80)
                logger.info(f"Found: {attr}\n{name}")
                # check if recording already exists on the disk
                if not os.path.isfile(os.path.join('audios', name+'.wav')):
                    if not '/' in attr:
                        # if ID is incorrect at all, we play the file
                        # and try to extract the link generated by amazon itself
                        logger.info(
                            "ID attribute was not found. Playing the file.")
                        play_icon = box.find_elements_by_class_name(
                                                                   'playButton')
                        get_onclick = play_icon[0].get_attribute('onclick')
                        driver.execute_script(get_onclick)
                        sleep(8)
                        get_source = box.find_elements_by_tag_name('source')
                        src = get_source[0].get_attribute('src')
                        # if we had success, link is appended to links
                        if 'https' in src:
                            links.append([src, name])
                        else:
                            logger.critical(
                                   "Link was not found after playing the file. "
                                   "Item skipped.")
                    else:
                        # If audio ID is valid, we replace audio with id
                        # and append it to the link.
                        # From now we can download it.
                        if attr.replace('audio-', ''):
                            attr = attr.replace('audio-', 'id=')
                            links.append([
                            'https://www.amazon.com/hz/mycd/playOption?'+attr,
                            name])
                else:
                    logger.info(f"File exists; passing: {name}.wav")
            except Exception:
                logger.critical(traceback.format_exc())
                logger.critical("Item failed; passing")
                continue
    return links

Our Main Function

Our Main function connects to the Amazon account based on the Credentials class and goes to Alexa's history. Then it creates a wide query that covers the entire history from day one. Then it emulates what a human will do in order to play each recording (which is allowed by Amazon), however then, it locates the audio file used for this playback and downloads it.

def main():
    ap = ArgumentParser()
    ap.add_argument(
        "-f", "--date_from", required=False, 
        help=("Seek starting from date MM/DD/YYYY.")
    )
    ap.add_argument(
        "-t", "--date_to", required=False,
        help=("Seek until date MM/DD/YYYY.")
    )
    args = vars(ap.parse_args())
    if args["date_from"] and not args["date_to"]:
        args["date_to"] = str(datetime.now().month) +'/'+ str(datetime.now(
                                        ).day) +'/'+ str(datetime.now().year)
    if args["date_to"] and not args["date_from"]:
        logger.critical("You haven't specified beginning date. Use -f option.")
        exit(1)

    sys_sleep = None
    sys_sleep = WindowsInhibitor()
    logger.info("System inhibited.")
    sys_sleep.inhibit()
    
    # start chromedriver
    driver = init_driver()

    while True:
        try:
            # login
            amazon_login(driver, args["date_from"], args["date_to"])
            break
        except TimeoutException:
            # catch broken connection
            logger.critical("Timeout exception. No internet connection? "
                            "Retrying...")
            sleep(10)
            continue

    # after few attempts will reset the page
    failed_page_attempt = 0
    while True:
        logger.info("Parsing links...")
        driver.implicitly_wait(2)

        try:
            # parse current page for audios
            links = parse_page(driver)
            # reset fail counter on each success
            failed_page_attempt = 0
        except TimeoutException:
            # catch broken connection
            logger.critical(traceback.format_exc())
            if failed_page_attempt <= 3:
                logger.critical("No Internet connection? Retrying...")
                logger.critical(f"Attempt #{failed_page_attempt}/3")
                sleep(5)
                failed_page_attempt += 1
                continue
            else:
                failed_page_attempt = 0
                logger.critical("Trying to re-render page...")
                driver.execute_script('getPreviousPageItems()')
                sleep(5)
                driver.execute_script('getNextPageItems()')
                continue

        logger.info(f"Total files to download: {len(links)}")

        for item in links:
            # download parsed items
            fetch(driver, item)

        # find the 'Next' button, which moves to the next page.
        failed_button_attempt = 0
        while True:
            try:
                check_btn = WebDriverWait(driver, 30).until(
                        EC.presence_of_element_located((By.ID, 'nextButton')))
                failed_button_attempt = 0
                break
            except TimeoutException:
                if failed_button_attempt <= 3:
                    logger.critical(
                            "Timeout exception: next button was not found. "
                            "No Internet connection? Waiting and retrying...")
                    logger.critical(f"Attempt #{failed_button_attempt}/3")
                    sleep(10)
                    failed_button_attempt += 1
                    continue
                else:
                    failed_button_attempt = 0
                    logger.critical("Trying to re-render page...")
                    driver.execute_script('getPreviousPageItems()')
                    sleep(5)
                    driver.execute_script('getNextPageItems()')
                    continue
        nextbtn = driver.find_element_by_id('nextButton').get_attribute('class')
        if 'navigationAvailable' in nextbtn:
            # if button is active, click it.
            driver.implicitly_wait(10)
            while True:
                try:
                    logger.info("Next page...")
                    driver.find_element_by_id('nextButton').click()
                    break
                except:
                    logger.critical("Unable to click the next button. "
                                    "Waiting and retrying...")
                    sleep(10)
                    continue
            continue
        else:
            # if button is inactive, this means it is the last page.
            # script is done here.
            break
    driver.close()
    driver.quit()
    if args['date_from']:
        logger.info('All done. Press Enter to exit.')
        i = input()
    else:
        logger.info("All done. Exit.")
    logger.info("System uninhibited.")
    sys_sleep.uninhibit()

Fetching a Selected Date Range

It is also possible to fetch recordings from a given date range:

To do so, use the -f and -t options which specifies the from date and the to date, e.g.:

python alexa.py -f 11/03/2019 -t 11/05/2019

Points of Interest

From time to time, Amazon might block the activity after identifying a mass download and in such case, our script just waits and then resumes. Here is the code that does that. What we do is check if the destination file is valid and if not (if its size is 0 bytes), we retry again.

                if os.path.isfile(os.path.join('audios', name+'.wav')):
                    if os.stat(os.path.join('audios', name+'.wav')).st_size == 0:
                        logger.info("File size is 0. Retrying.")
                        sleep(3)
                        continue

After all, if Amazon stores our personal recordings, why can't we?

History

7^th November, 2019: Initial version