|
||
---|---|---|
.gitignore | ||
README.md | ||
app.go | ||
index.html |
README.md
gorltom
golang url to mark-down API
gorltom is a simple to use API that takes in a full url as a string on this endpoint:
https://gorltom.corbia.net/api/url
It will then open the page with chromedp (just in case we need to wait for some JS generated content...) and will then take this html atrocity:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta property="og:url" content="https://literally_the_current_url_thank_you.com">
<meta property="og:something-else" content="but not used properly">
<meta property="og:something-useful" content="if only dev followed some standards">
<link type="text/css" rel="stylesheet" href="https://cdn.why.com/too-much.min.css">
<link type="text/css" rel="stylesheet" href="https://cdn.bootstrap.com/flexbox-are-hard.min.css">
<title>Title of the example webpage</title>
</head>
<body>
<div class="basically-the-body-tag">
<noscript>This website works better with JavaScript.</noscript>
<div class="bloat that is only usefull for browsers">
<div class="some-ugly-class">
<nav id="top-menu">
<ul class="nostyle ul bs">
<li class="random bs 342345234fffDDD">
<span class="menu item obviously">
<a href="/about" target="_blank">ABOUT</a>
</span>
</li>
<li class="random bs 342345234fffDDD">
<span class="menu item obviously">
<a href="/blog" target="_blank">BLOG</a>
</span>
</li>
</ul>
</nav>
</div>
<aside>
...
</aside>
<div>
<section class="main">
<article>
<header class="article-header-top-max">
<h3>Title of the article</h3>
</header>
<p>Text of the first paragraph of the article.</p><br>
<p>Text of the second paragraph of the article.</p><br>
<p>Text of the third paragraph of the article but this time it contains a <a href="https://link-to-another-website.com/example">link</a> inside of the text.</p><br>
</article>
</section>
</div>
</div>
</div>
<script src="https://cdn.spyware.com/lib.min.js"></script>
<script src="https://cdn.spyware.com/other-lib.min.js"></script>
<script src="https://cdn.google.com/something-probably-evil.min.js"></script>
</body>
</html>
And return this beautiful markdown as a string:
# Title of the example webpage
###### (*gorltom extract of https://notexample.com/*)
###### *assumed_menu*
- [ABOUT](https://notexample.com/about)
- [BLOG](https://notexample.com/blog)
###### *article*
### Title of the article
Text of the first paragraph of the article.
Text of the second paragraph of the article
Text of the third paragraph of the article but this time it contains a [link]("https://link-to-another-website.com/example") inside of the text.
The API will be expecting the following JSON:
{
"url": "https://full-url-of.com/the/page"
}
And will return the following:
{
"md" : "# Home of full-url-of\n###### (*gorltom extract of https://full-url-of.com/the/page*)\n\n## Some header\n\n#### A tagline maybe\n\n###### *assumed_menu*\n- [HTML for newbies](https://full-url-of.com/html)\n- [CSS for artists](https://full-url-of.com/css)"
}
gorltom is opiniated.
Every nav is treated as an "assumed_menu", if the html contains <main>
or <article>
tags, it will be indicated in the markdown version.
Every table will be turned into csv
<table>
<thead>
<td>First Name</td>
<td>Age</td>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>32</td>
</tr>
<tr>
<td>Bob</td>
<td>34</td>
</tr>
</tbody>
</table>
First Name,Age
Alice, 32
Bobo, 34
The HTML is parsed from top to bottom, node after node.