[Question] Sed and regex string manipulation

[email protected]

I've got a sed regex that should work, just writing up a breakdown of the whole command so anyone interested can follow what it does. Will post in a bit.

[email protected]

Hello Thanks for your reply !

That's exactly what I did and how I came to my "final" result but I doesn't work as expected... because the lack of knowledge and understanding !

Will give sd a try and see if I can come up with something ! Thanks for the pointer !

[email protected]

This would be awesome ! A breakdown of the whole command will give me a better understanding !

Thank you in advance, waiting for your post

[email protected]

Hello,

I have thought of a python script and looked a bit around but couldn't find something satisfactory. Also I'm a tiny bit more versed in bash/CLI than with python... Even though that's very arguable !

I looked through the Github repo and at first glance I have no idea how this could do the job, again I probably have to dig a bit deeper and understand what this is actually doing !

Thanks for the pointer will give it a try

[email protected]

Is the '%MARKDOWN' part of your example correct? That should also be converted to a dash? Or did you forget the 20 there?

[email protected]

Oupsi ! Forgot the 20 there !

[email protected]

Okay, here's the command and a breakdown. I broke down every part of the command, not because I think you are dumb, but because reading these can be complicated and confusing. Additionally, detailed breakdowns like these have helped me in the past.

The command:

sed -ri 's|]\(#.+\)|\L&|; s|%20|-|g' /path/to/somefile

The breakdown:

sed - calls sed

-r - allows for the use of extended regular expressions

-i - edit the file given as an argument at the end of the command (note, the i flag must follow the r flag, or the extended regular expressions will not be evaluated)

Now the regex piece by piece. This command has two substitution regex to break down the goals into managable chunks.

Expression one is to convert the markdown links to lowercase. That expression is:

's|]\(#.+\)|\L&|;

The goal of this expression is to find markdown links, and to ignore https links. In your post you indicate the markdown links all start with a # symbol, so we don't have to explicitly ignore the https as much as we just have to match all links starting with #. Here's the breakdown:

' - begins the entire expression set. If you had to match the ' character in your expression you would begin the expression set with " instead of '.

s| - invoking find and replace (substitution). Note, Im using the | as a separator instead of the / for easier readability. In sed, you can use just about any separator you want in your syntax

]\(# - This is how we find the link we want to work on. In markdown, every link is preceded by ]( to indicate a closing of the link text and the opening of the actual url. In the expression, the ( is preceded by a \ because it is a special regex character. So \( tells sed to find an actual closing parentheses character. Finally the # will be the first character of the markdown links we want to convert to lowercase, as indicated by your example. The inclusion of the # insures no https links will be caught up in the processing.

.+ - this bit has two parts, . and +. These are two special regex characters. the . tells sed to find any character at all and the + tells it to find the preceding character any number of times. In the case of .+, it's telling sed to find any character and this pattern can repeat any number of times. You might think this will eat ALL of the text in the document and make it all lowercase, but it will not because of the next part of the regex.

\) - this tells sed to find a closing parentheses. Like the opening parenthese, it is a special regex character and needs to be escaped with the backslash to tell sed to find an actual closing parentheses character. This is what stops the command from converting the entire document to lowercase, because when you combine the previous bit with this bit like so .+\), you're telling sed to find any character UNTIL you find a closing parentheses.

| - This tells sed we're done looking for text to match. The next bits are about how to modify/replace that text

\L - This tells sed to convert the given text to all lowercase

& - Tells sed to convert the entire pattern matched to lowercase.

; - this tells sed that this is the end of the first expression, and that more are coming.

So all together, what this first expression does is: Find a closing bracket followed by an opening parentheses followed by a pound/hash symbol followed by any number of any characters until finding a closing parentheses. Then convert that entire chunk of text to lowercase. Because symbols don't have case you can just convert the entire matched pattern to lowercase. If there were specific parts that had to be kept case sensitive, then you'd have to match and modify more precisely.

The next expression is pretty easy, UNLESS any of your https links also include the string %20:

If no https links contain the %20 string, then this will do the trick:

s|%20|-|g'

s| - again opens the expression telling sed wer're looking to substitute/modify text

%20 - tells sed to find exactly the character sequence %20

| - ends the pattern matching portion of the expression

- - tells sed to replace the matched pattern with the exact character -

| - tells sed that's the end of the modification instructions

g - tells sed to do this globally throughout the document. In other words, to find all occurrances of the string %20 and replace them with the string -

' - tells sed that is the end of the expression(s) to be evaluated.

So all together, what this expression does is: Within the given document, find every occurrence of a percent symbol followed by the number two followed by the number zero and replace them with the dash character.

/path/to/somefile - tells sed what file to work on.

Part of using regex is understanding the contents of your own text, and with the information and examples given, this should work. However, if the markdown links have different formatting patterns, or as mentioned any of the https links have the %20 string in them, or other text in the document might falsely match, then you'd have to provide more information to get a more nuanced regex to match.

[email protected]

Thank you, thank you very much for taking your time to help me out here ! I really appreciate your full breakdown and complete development ! I didn't tried it out yet but skimming through your post I'm sure it will work out !

However, I forgot to mention something:

The goal of this expression is to find markdown links, and to ignore https links. In your post you indicate the markdown links all start with a # symbol, so we don’t have to explicitly ignore the https as much as we just have to match all links starting with #.

This is only true for links in the same file, if i link to another file it look something like this:

[Why SVT-AV1 over AOM?](readme.md#Why%20SVT-AV1%20over%20AOM?)

I'm can try to wrap my head around and find a solution by myself, with your well written breakdown I'm sure I can try something out. But if you think it will be to complex for my limited knowledge feel free to adjust :).

Do you mind If I ping you if I'm not able to solve the issue?

Thank again !!!!

[email protected]

Feel free to ping me if you want some help! I'd say I'm intermediate with regex, but I'm happy to help where I can.

Regarding the other file, you could pretty easily modify the command I gave you to adapt to the example you gave. There's two approaches you could take.

This is focused on the first regex in the command. The second should work unmodified on the other files if they follow the same pattern.

Here's the original chunk:

s|]\(#.+\)|\L&|

In the new example given, the # is preceded by readme.md. The easy modification is just to insert readme\.md before the # in the expression, adding the \ before the . to escape the metacharacter and match the actual period character, like so:

s|]\(readme\.md#.+\)|\L&|

However, if you have other files that have similar, but different patterns, like (faq.md#%20link%20text), and so on, you can make the expression more universal by using the .* metacharacter sequence. This is similar to the .+ metacharacter sequence, with one difference. The + indicates one or more times, while the * indicates zero or more times. So by using .* before the # you can likely use this on all the files if they follow the two pattern examples you gave.

If that will work, this would be the expression:

s|]\(.*#.+\)|\L&|

What this expression does is:

Find find a closing bracket followed by a opening parentheses followed by any sequence of characters (including no characters at all) until finding a pound/hash symbol then finding one or more characters until finding a closing parentheses, and convert that entire matched string to lowercase.

[email protected]

Sorry to spam your unread message !

I played a bit around and came to the following conclusion:

s|]\(#.+\)|\L&| - Works great for in document links so I further expanded to this s|]\(#.+\)|\L&|;s|]\(.+#.+\)|\L&| to also add the following pattern [Some Text](readme.md#hello%20world.md)

s|%20|-|g - Works on every occurrence of %20 even for the following pattern [Some text](https://my/%20home%20page.com) which would break all external links to the web. So I used this /https/ ! s|%20|-|g

It's probably very sloppy what I'm doing and not as elegant as your command but it does the trick If you to further expand on it feel free however the following command does exactly what I wanted:

sed -re 's|]\(#.+\)|\L&|;s|]\(.+#.+\)|\L&|;/https/ ! s|%20|-|g'

Thanks again from the bottom of my heart !

[email protected]

Nicely done! Happy I could help. I also did just reply with another modification if you want to take a look at how that one would work.

There's a million ways to do it and none are "right", so I wouldn't call yours sloppy at all. I'm still learning and have lots of slop in my own expressions.

I'll turn around and ask you a question if you don't mind. That last bit you used, I kind of understand what it's doing, but not fully. I'm getting that it's excluding https, but if you could explain the syntax I'd really appreciate it!

This is the bit:

/https/ ! s|%20|-|g

[email protected]

Haha we cross-replied !

.* did the trick and removes my additional s|]\(.+#.+\) to include that pattern form my last reply !

Last question https/ ! s|%20|-| change all occurrence of %20 in the whole file except if it begins with https, is there any way to just change that occurrence when it appears in the markdown link pattern ?

e.g. replace in [Some text](some%20text.md) but not If Hello I'm just some%20place holder text ?

Thanks again for your easy to read and very informative walk through ! 🤩

[email protected]

Sure

I don't know if it still a thing but in the past some web URLs had spaces in their addresses e.g. https://www.my/%20website%20with%20spaces.com. In markdown you can link to external web addresses like so [some link to a web address](https://my/%20website%20with%20spaces.com) but /https/ ! s|%20|-|g removes all occurrences of %20 (which is consider as space in html? Sorry if I'm wrong here ;s still have a lot to learn) and will replace to some link to a web address which would break the link to that website.

[email protected]

Sure

I don’t know if it still a thing but in the past some web URLs had spaces in their addresses e.g.

https://www.my/%20website%20with%20spaces.com

In markdown you can link to external web addresses like so

[some link to a web address](https://my/%20website%20with%20spaces.com)

However, /https/ ! s|%20|-|g replaces all occurrences of %20 (which is consider a space in html? Sorry if I’m wrong here :s still have a lot to learn) with -. This would break the link the the web URL [some link to a web address](https://my-website-with-spaces.com/). Am I wrong here?

If I may I just found something else that doesn't quite work and it seems a bit harder to fix i think ! Sometimes I have links in this form:

[1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles)

As you can see I append the header with 1.3 but as dumb as it is... it also need to be 1-3-subtitles

e.g.

[1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles)

Needs to become

[1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1-3-Subtitles)

Sorry for my bad English trying my best haha ! Hope it's comprehensible.

[email protected]

Quick question as I'm working on this, in the new link example, is the BDMV supposed to be converted to lowercase, or to remain uppercase?

[email protected]

As I see, you've already got an answer how to convert text to lower case. So I just tell you how to replace all occurrences of %20 with -. You need to repeat substitution until no matches found. For such iteration you need to use branching to label. Below is sed script with comments.

:subst # label
s/(\[[^]]+\]\([^)]*#[^)]*)%20([^)]*\))/\1-\2/ # replace the first occurrence of `%20` in the URL fragment
t subst # go to the 'subst' label if the substitution took place

However there are some cases when this script will fail, e. g. if there is an escaped ] character in the link text. You cannot avoid such mistakes using only regexps, you need a full featured markdown parser for this.

[email protected]

Don't worry or apologize about your English. I'm having no trouble understanding.

I'm going to take the second part first and come back with another comment to address the %20 and https bits.

So these variations, like [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles), are where you would start to craft a new expression. Trying to catch every variation in a single expression would get to complicated and more likely to fail and/or modify text you don't want modified.

So in this case, here's the expression I'd use:

sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\1-\2|' somefile

And the breakdown:

sed -ri calls sed with the expanded regular expressions capabilities and to edit the file in place

's| - Begins the pattern match|modify expression

( - This very first opening parentheses is a special metacharacter that is used to group a sub-expression within the larger expression. By doing this we can create variables that we can refer to in the modification portion of the command.

]\( - Find the closing bracket character and an opening parentheses character, which we know will be the beginning of a markdown url. The backslash precedes the open parentheses to escape it and indicate it needs to look for the actual open parentheses character

.+ - Find any character (indicated by the .) one or more times (indicated by the +). This will find any characters until it gets to the next specified character in the expression

[0-9]+ - This is two parts. The first part is [0-9]. The brackets are metacharacters in regex that enclose a character set to match from. In this case the character set is the numbers zero to nine. What this means on its own is that sed will look for one occurrence of any number between zero and nine. The + tells sed to find one or more occurrences of a number between one and nine until it gets to the next portion of the pattern. I did this because I don't know the upper bounds of the documentation numeration you're working with in the links. If all the links only contain single digit numbers before the decimal, you can remove the +.

) - This closing parentheses marks the end of the subexpression that we want to refer to. In this case, the sub expression is capturing from the closing bracket up to (but not including) the decimal in the number.

\. - This tells sed to find the period/dot/decimal character in the number. It's preceded by the backslash because the period/dot/decimal character is a metacharacter in regular expressions.

( - This is the beginning of a new subexpression

[0-9]+ - The numeral capture repeats to find the number after the period/dot/decimal. Similarly to the number before the decimal, if the number after the decimal is only ever single digit, the + can be removed.

.+ - Find any character (indicated by the .) one or more times (indicated by the +). This will find any characters until it gets to the next specified character in the expression, taking us to the end of the url

\) - Find the closing parentheses of the url. The backslash precedes the closing parentheses to escape it and indicate it needs to look for the actual open parentheses character.

) - This closes our second subexpression, which captures everything from the number after the decimal to the closing parentheses of the link.

| - Indicates the end of the pattern matching portion of the expression/command. and the beginning of the modification part of the command/expression.

\1 - This is how we refer to or call the subexpressions. The syntax is a backslash followed by a number, and the number indicates the sequential position of the subexpression. So \1 refers to this portion of the regex in the command above: (]\(.+[0-9]+). This section of the expression is capturing everything from the closing bracket up to (but not including) the period/dot/decimal character. By using it in this position in the substitution/modification, we're just using it as a variable, so in the substitution, it's going to put everything it finds in the first subexpression first in the new/modified string of text.

- - This tells sed to put a dash immediately after the first subexpression in this new/modified string of text, effectively replacing the period/dot/decimal in the number portion of the url.

\2 - This is calling the second subexpression, which is this portion of the pattern matching regex: [0-9]+.+\). This captures everything in the url from the number after the period/dot/decimal (not including the decimal), to the closing parthenses of the markdown url. Used in this position of the substitution it tells sed to place it after the dash in the new/modified text.

|' - This indicates the end of the modification portion of the command and closes the match|substitution expression.

somefile - The file to be worked on

Here is the full command again: sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\1-\2|' somefile

Altogether what this does is: Begin the first subexpression that starts with finding a closing bracket followed by an opening parentheses followed by any character one or more times until finding at least one or more numbers between zero and nine until it finds a decimal, and then close and remember what was found for this sub expression (not including the decimal). Then begin the second subexpression that starts with finding a number between zero and nine one or more times, and then find any character any number of times until a closing parentheses is found. Then close and remember what was found in this subexpression. Replace everything with subexpression one followed by a dash followed by subexpression two.

If you also need this markdown link text to be converted to lowercase, just add \L to the replacement section before the \1 like so:

sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\L\1-\2|' somefile

[email protected]

NB: global substitution s///g is not applicable here because you need to perform new substitutions in a substituted text. Both sed regexp syntaxes (basic and extended) don't support lookarounds that could solve this issue.

[email protected]

Bad advise for sed. regex101 doesn't support POSIX regexes, so you are unable to get the same results as with sed.

[email protected]

This is very close

sed ':loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g;t loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g'

example file

[Some text](#Header%20Linking%20MARKDOWN.md)
(#Should%20stay%20as%20is.md)
Text surrounding [a link](readme.md#Other%20Page). Cool
Multiple [links](#Links.md) in (%20) [a](#An%20A.md) SINGLE [line](#Lines.md)
Do [NOT](https://example.com/URL%20Should%20Be%20Untouched.html) CHANGE%20 [hyperlinks](http://example.com/No%20Touchy.html)

but it doesn't work if you have a http link and markdown link in the same line, and doesn't work with [escaped \] square brackets](#and-escaped-\)-parenthesis) in the link

but!! it was fun!

agnos.is Forums

[Question] Sed and regex string manipulation