A not so CeWL way to build a Wordlist

Introduction

Whilst there are many useful tools for building tailored wordlists (such as CeWL - https://github.com/digininja/CeWL and CUPP - https://github.com/Mebus/cupp) saving words while browsing a website is often overlooked and can help create the ideal wordlist for file/directory discovery and further enumeration.
This post will cover a new Firefox extension we’ve created web2words that saves all words from websites as you browse.

Warning: This extension is currently in testing (be sure to read and understand the code prior to use) so you will need to use Firefox debugging to import.
Use on your testing instance of Firefox only (e.g. Firefox Developer Edition - https://www.mozilla.org/en-US/firefox/developer/) and not your main browsing instance.

Firefox Extension Creation

Mozilla provides excellent, detailed information on creating extensions at https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Your_first_WebExtension
This post will skip over all of the basics and simply show the extension file contents, how to load the extension into Firefox and how to use it in your testing activities.
Prior to commencing, create a directory to save all of the files, e.g. web2words

manifest.json

Save the following to web2words/manifest.json

{
  "manifest_version": 3,
  "name": "Webpage Text Saver",
  "version": "1.0",
  "description": "Saves all text from a webpage and updates on changes.",
  "permissions": ["activeTab", "tabs", "storage", "downloads"],
  "background": {
    "scripts": ["backgroundScript.js"]
  },
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "js": ["contentScript.js"]
    }
  ]
}

Further reading on manifest.json https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/manifest.json

contentScript.js

Save the following to web2words/contentScript.js

// Listen for the page to fully load
window.addEventListener("load", function () {
  // Once the page is loaded, extract all text
  const pageText = document.body.innerText; // Get all visible text from the page
  browser.runtime.sendMessage({ action: "saveText", text: pageText });

  // Initialize the MutationObserver
  const observer = new MutationObserver(function (mutations) {
    console.log("DOM has changed");
    const updatedPageText = document.body.innerText; // Get updated text from the page
    browser.runtime.sendMessage({ action: "updateText", text: updatedPageText });
  });
  observer.observe(document, {
    childList: true,
    subtree: true
  });
});

// Listen for URL changes (hash changes)
window.addEventListener("hashchange", function () {
  const pageText = document.body.innerText; // Get all visible text from the page
  browser.runtime.sendMessage({ action: "saveText", text: pageText });
});

// Listen for refresh messages from background script
browser.runtime.onMessage.addListener(function (request, sender, sendResponse) {
  if (request.action === "refreshText") {
    const pageText = document.body.innerText; // Get all visible text from the page
    browser.runtime.sendMessage({ action: "saveText", text: pageText });
  }
});

Further reading on content scripts - https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Content_scripts

backgroundScript.js

Save the following to web2words/backgroundScript.js

let storedText = "";

// Listen for messages from content scripts
browser.runtime.onMessage.addListener(async function(request, sender, sendResponse) {
  if (request.action === "saveText") {
    storedText = request.text; // Initialize stored text
    // Save to storage
    await browser.storage.local.set({ text: storedText });
    // Save stored text to a file
    await saveToFile();
  } else if (request.action === "updateText") {
    storedText += "\n--- Updated Text ---\n" + request.text; // Append updated text
    // Save to storage
    await browser.storage.local.set({ text: storedText });
    // Save updated text to a file
    await saveToFile();
  }
});

// Listen for URL changes
browser.tabs.onUpdated.addListener(function (tabId, changeInfo, tab) {
  if (changeInfo.status === "complete") {
    browser.tabs.sendMessage(tabId, { action: "refreshText" });
  }
});

// Function to save stored text to a file
async function saveToFile() {
  const text = await browser.storage.local.get("text");
  if (text.text) {
    const blob = new Blob([text.text], { type: "text/plain" });
    const url = URL.createObjectURL(blob);
    await browser.downloads.download({
      url: url,
      filename: "webpageText.txt",
      saveAs: false
    });
    URL.revokeObjectURL(url);
  }
}

Further reading on background scripts - https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/manifest.json/background

Bash helper scripts

Two helper scripts (clean.sh and gen-words.sh) are also used (these will likely be phased out when the extension matures) to generate a clean wordlist and delete downloaded files.
Adapt as required for your use.

gen-words.sh

This script will generate a words.txt file with a clean wordlist.
Save the following to web2words/gen-words.sh

#!/bin/bash
cat ~/Downloads/webpageText* | sort -u | tr " " "\n" | tr "\t" "\n" | sort -u >> words.txt
cat words.txt | sed "s/[^[:alnum:]]/\n/g" | sort -u >> words.txt
sort -u words.txt -o words.txt

clean.sh

To clean-up the downloaded webpageText files and delete your words.txt wordlist.
Save the following to web2words/clean.sh

#!/bin/bash
rm words.txt
rm ~/Downloads/webpageText*

Loading the extension

In Firefox, navigate to about:debugging#/runtime/this-firefox then click “Load Temporary Add-on” as shown below:

Click on any file in the web2words directory relating to the extension (e.g. manifest.json) and confirm the extension has been loaded as follows:

Testing it out

Navigate to any website (e.g. https://nmap.org) and confirm the Firefox Download progress indicator flashes:

Clicking on the Download button will reveal the filename for each download (i.e. webpageText.txt and so on for each subsequent download):

Generating and viewing the wordlist

To generate the wordlist, simply run ./gen-words.sh
View the contents as follows:

cat words.txt
...
administrators
...
API
...
capture
...
Ncat
...
Ndiff
...
nmap
...

At this point we’ve confirmed everything is functional. Continue browsing your target website and run the ./gen-words.sh script from time to time to update the wordlist.

Conclusion

This concludes the basic setup and use of our web2words Firefox Extension.
Feel free to contact us if you have any suggestions, questions or would like to schedule a meeting to discuss anything further.